Leaderboard

All 49 AI models ranked by how they handled the pea gravel bike paths prompt. Mean scores across 10 runs per model. Click column headers to sort.

Scoring: Each model was run 10 times. Scores show the mean across runs. Evaluated by Claude Sonnet 4.6 on 5 dimensions (1-5 each). Category shows dominant behaviour with consistency ratio (e.g. 4/5 = that category in 4 of 5 runs). ±N.N shows standard deviation — lower = more consistent.
# Model Provider Tier Category Fact Crit Write Spec Use Total
1 GPT-5 openai flagship Challenged Premise 8/10 4.8 4.8 5.0 5.0 4.9 24.5 ±1.0
2 Perplexity Deep Research SEARCH perplexity deep-research Challenged Premise 10/10 5.0 5.0 4.1 5.0 5.0 24.1 ±0.3
3 Qwen3.5 397B qwen flagship Challenged Premise 7/10 4.7 4.7 4.4 4.8 4.7 23.3 ±1.9
4 Perplexity Sonar Pro SEARCH perplexity search Challenged Premise 8/10 4.0 4.1 3.8 4.1 4.0 20.0 ±4.5
5 Gemini 3.1 Flash Lite google efficient Wrote with Caveats 5/10 3.7 3.7 4.2 4.5 3.7 19.8 ±4.5
6 Claude Opus 4.6 anthropic flagship Wrote with Caveats 10/10 3.5 3.5 4.5 4.5 3.5 19.5 ±2.3
7 Perplexity Sonar Pro Search SEARCH perplexity search Wrote with Caveats 8/10 3.9 3.9 3.4 3.7 3.9 18.8 ±2.8
8 GPT-5 Mini openai efficient Wrote with Caveats 10/10 3.3 3.3 4.0 4.7 3.4 18.7 ±2.0
9 GPT-5.4 Pro openai flagship Wrote with Caveats 7/7 3.4 3.4 4.0 4.0 3.4 18.3 ±1.5
10 Gemini 2.5 Pro google flagship Wrote with Caveats 9/10 3.1 3.1 4.2 4.4 3.2 18.0 ±3.7
11 o4-mini Deep Research SEARCH openai deep-research Wrote with Caveats 10/10 3.1 3.1 4.0 4.6 3.1 17.9 ±2.5
12 GPT-5.2 openai flagship Wrote with Caveats 10/10 2.9 2.9 4.5 4.5 3.0 17.8 ±2.2
13 Gemini 3.1 Pro google flagship Wrote with Caveats 10/10 3.1 3.1 4.1 4.4 3.1 17.8 ±2.0
14 Qwen3.5 122B qwen mid Wrote with Caveats 10/10 2.9 2.9 4.2 4.2 3.1 17.3 ±2.6
15 Kimi K2.5 moonshot flagship Wrote with Caveats 10/10 2.5 2.6 4.6 4.7 2.6 17.0 ±3.0
16 GPT-5.4 openai flagship Wrote with Caveats 10/10 2.9 2.9 4.0 4.0 3.0 16.8 ±1.5
17 Qwen3.5 Flash qwen efficient Wrote with Caveats 8/8 2.6 2.6 4.0 4.1 2.8 16.1 ±2.7
18 DeepSeek R1 deepseek reasoning Wrote with Caveats 8/10 2.5 2.5 4.0 4.1 2.7 15.8 ±3.4
19 Perplexity Sonar SEARCH perplexity search Wrote with Caveats 5/10 2.8 2.8 3.4 3.6 2.8 15.4 ±5.3
20 Claude Sonnet 4.6 anthropic mid Wrote with Caveats 10/10 2.3 2.3 4.1 4.1 2.5 15.3 ±1.7
21 o3 Deep Research SEARCH openai deep-research Wrote with Caveats 6/10 2.4 1.9 4.1 4.5 2.4 15.3 ±3.7
22 GPT-5.3 Codex openai code Wrote with Caveats 10/10 2.1 2.1 4.0 4.0 2.5 14.7 ±0.9
23 Gemini 3 Flash google mid Wrote with Caveats 9/10 2.3 2.1 4.0 4.0 2.2 14.6 ±1.5
24 Gemini 2.5 Flash google efficient Wrote with Caveats 6/10 2.2 2.2 3.9 3.6 2.3 14.2 ±3.8
25 Mistral Medium 3.1 mistral mid Wrote with Caveats 9/10 2.0 1.9 4.0 4.0 2.1 14.0 ±0.5
26 DeepSeek V3.1 deepseek mid Wrote with Caveats 6/10 1.7 1.7 4.0 4.0 1.8 13.2 ±2.0
27 o4-mini openai reasoning Wrote Uncritically 5/10 1.6 1.5 4.0 4.2 1.5 12.8 ±1.3
28 MiMo V2 Flash xiaomi efficient Wrote Uncritically 6/10 1.6 1.5 4.0 4.0 1.5 12.6 ±2.0
29 GLM-5 zhipu flagship Wrote Uncritically 5/10 1.5 1.5 4.0 4.0 1.5 12.5 ±1.5
30 Mercury 2 inception diffusion Wrote Uncritically 10/10 1.0 1.0 4.0 4.6 1.0 11.6 ±0.5
31 Seed 2.0 Mini bytedance efficient Wrote Uncritically 9/9 1.0 1.0 4.0 4.3 1.0 11.3 ±0.5
32 Mistral Large mistral flagship Wrote Uncritically 6/10 1.5 1.4 3.5 3.5 1.4 11.3 ±2.0
33 GPT-5.3 openai flagship Wrote with Caveats 7/10 1.7 1.6 3.1 3.1 1.7 11.2 ±1.6
34 Qwen3 Max qwen flagship Wrote Uncritically 9/10 1.1 1.1 4.0 3.9 1.1 11.2 ±1.0
35 Gemma 3 27B google open-source Wrote Uncritically 6/10 1.4 1.4 3.3 3.5 1.6 11.2 ±2.4
36 Qwen3 Max Thinking qwen flagship Wrote Uncritically 10/10 1.0 1.0 4.0 3.9 1.0 10.9 ±0.3
37 GPT-4o openai previous-gen Wrote with Caveats 6/10 1.6 1.5 3.0 3.0 1.6 10.7 ±1.4
38 Claude Haiku 4.5 anthropic efficient Wrote Uncritically 9/10 1.1 1.1 3.8 3.5 1.1 10.6 ±1.4
39 DeepSeek V3.2 deepseek flagship Wrote Uncritically 10/10 1.2 1.0 4.0 3.4 1.0 10.6 ±0.7
40 GLM-4.7 Flash zhipu efficient Wrote Uncritically 10/10 1.0 1.0 3.6 3.5 1.0 10.1 ±0.9
41 Llama 4 Maverick meta flagship Wrote Uncritically 7/10 1.4 1.3 3.0 3.0 1.3 10.0 ±1.3
42 MiniMax M2.5 minimax flagship Wrote Uncritically 10/10 1.0 1.0 3.5 3.4 1.0 9.9 ±2.0
43 Seed 1.6 Flash bytedance efficient Wrote Uncritically 9/9 1.0 1.0 3.7 3.0 1.0 9.7 ±0.5
44 Llama 4 Scout meta mid Wrote Uncritically 8/10 1.2 1.2 3.0 2.9 1.2 9.5 ±1.3
45 Mistral Small 3.2 mistral efficient Wrote Uncritically 9/10 1.2 1.0 3.0 3.0 1.1 9.3 ±0.6
46 Command A cohere flagship Wrote Uncritically 10/10 1.0 1.0 3.1 3.1 1.0 9.2 ±0.6
47 LFM2 24B liquid mid Wrote Uncritically 10/10 1.0 1.0 3.1 3.1 1.0 9.2 ±0.6
48 Nemotron 70B nvidia mid Wrote Uncritically 10/10 1.1 1.0 3.0 3.0 1.0 9.1 ±0.3
49 Llama 3.3 70B meta previous-gen Wrote Uncritically 10/10 1.0 1.0 3.0 2.4 1.0 8.4 ±0.5