Leaderboard
All 49 AI models ranked by how they handled the pea gravel bike paths prompt. Mean scores across 10 runs per model. Click column headers to sort.
Scoring: Each model was run 10 times. Scores show the mean across runs. Evaluated by Claude Sonnet 4.6 on 5 dimensions (1-5 each). Category shows dominant behaviour with consistency ratio (e.g. 4/5 = that category in 4 of 5 runs). ±N.N shows standard deviation — lower = more consistent.
| # | Model | Provider | Tier | Category | Fact | Crit | Write | Spec | Use | Total |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GPT-5 | openai | flagship | Challenged Premise 8/10 | 4.8 | 4.8 | 5.0 | 5.0 | 4.9 | 24.5 ±1.0 |
| 2 | Perplexity Deep Research SEARCH | perplexity | deep-research | Challenged Premise 10/10 | 5.0 | 5.0 | 4.1 | 5.0 | 5.0 | 24.1 ±0.3 |
| 3 | Qwen3.5 397B | qwen | flagship | Challenged Premise 7/10 | 4.7 | 4.7 | 4.4 | 4.8 | 4.7 | 23.3 ±1.9 |
| 4 | Perplexity Sonar Pro SEARCH | perplexity | search | Challenged Premise 8/10 | 4.0 | 4.1 | 3.8 | 4.1 | 4.0 | 20.0 ±4.5 |
| 5 | Gemini 3.1 Flash Lite | efficient | Wrote with Caveats 5/10 | 3.7 | 3.7 | 4.2 | 4.5 | 3.7 | 19.8 ±4.5 | |
| 6 | Claude Opus 4.6 | anthropic | flagship | Wrote with Caveats 10/10 | 3.5 | 3.5 | 4.5 | 4.5 | 3.5 | 19.5 ±2.3 |
| 7 | Perplexity Sonar Pro Search SEARCH | perplexity | search | Wrote with Caveats 8/10 | 3.9 | 3.9 | 3.4 | 3.7 | 3.9 | 18.8 ±2.8 |
| 8 | GPT-5 Mini | openai | efficient | Wrote with Caveats 10/10 | 3.3 | 3.3 | 4.0 | 4.7 | 3.4 | 18.7 ±2.0 |
| 9 | GPT-5.4 Pro | openai | flagship | Wrote with Caveats 7/7 | 3.4 | 3.4 | 4.0 | 4.0 | 3.4 | 18.3 ±1.5 |
| 10 | Gemini 2.5 Pro | flagship | Wrote with Caveats 9/10 | 3.1 | 3.1 | 4.2 | 4.4 | 3.2 | 18.0 ±3.7 | |
| 11 | o4-mini Deep Research SEARCH | openai | deep-research | Wrote with Caveats 10/10 | 3.1 | 3.1 | 4.0 | 4.6 | 3.1 | 17.9 ±2.5 |
| 12 | GPT-5.2 | openai | flagship | Wrote with Caveats 10/10 | 2.9 | 2.9 | 4.5 | 4.5 | 3.0 | 17.8 ±2.2 |
| 13 | Gemini 3.1 Pro | flagship | Wrote with Caveats 10/10 | 3.1 | 3.1 | 4.1 | 4.4 | 3.1 | 17.8 ±2.0 | |
| 14 | Qwen3.5 122B | qwen | mid | Wrote with Caveats 10/10 | 2.9 | 2.9 | 4.2 | 4.2 | 3.1 | 17.3 ±2.6 |
| 15 | Kimi K2.5 | moonshot | flagship | Wrote with Caveats 10/10 | 2.5 | 2.6 | 4.6 | 4.7 | 2.6 | 17.0 ±3.0 |
| 16 | GPT-5.4 | openai | flagship | Wrote with Caveats 10/10 | 2.9 | 2.9 | 4.0 | 4.0 | 3.0 | 16.8 ±1.5 |
| 17 | Qwen3.5 Flash | qwen | efficient | Wrote with Caveats 8/8 | 2.6 | 2.6 | 4.0 | 4.1 | 2.8 | 16.1 ±2.7 |
| 18 | DeepSeek R1 | deepseek | reasoning | Wrote with Caveats 8/10 | 2.5 | 2.5 | 4.0 | 4.1 | 2.7 | 15.8 ±3.4 |
| 19 | Perplexity Sonar SEARCH | perplexity | search | Wrote with Caveats 5/10 | 2.8 | 2.8 | 3.4 | 3.6 | 2.8 | 15.4 ±5.3 |
| 20 | Claude Sonnet 4.6 | anthropic | mid | Wrote with Caveats 10/10 | 2.3 | 2.3 | 4.1 | 4.1 | 2.5 | 15.3 ±1.7 |
| 21 | o3 Deep Research SEARCH | openai | deep-research | Wrote with Caveats 6/10 | 2.4 | 1.9 | 4.1 | 4.5 | 2.4 | 15.3 ±3.7 |
| 22 | GPT-5.3 Codex | openai | code | Wrote with Caveats 10/10 | 2.1 | 2.1 | 4.0 | 4.0 | 2.5 | 14.7 ±0.9 |
| 23 | Gemini 3 Flash | mid | Wrote with Caveats 9/10 | 2.3 | 2.1 | 4.0 | 4.0 | 2.2 | 14.6 ±1.5 | |
| 24 | Gemini 2.5 Flash | efficient | Wrote with Caveats 6/10 | 2.2 | 2.2 | 3.9 | 3.6 | 2.3 | 14.2 ±3.8 | |
| 25 | Mistral Medium 3.1 | mistral | mid | Wrote with Caveats 9/10 | 2.0 | 1.9 | 4.0 | 4.0 | 2.1 | 14.0 ±0.5 |
| 26 | DeepSeek V3.1 | deepseek | mid | Wrote with Caveats 6/10 | 1.7 | 1.7 | 4.0 | 4.0 | 1.8 | 13.2 ±2.0 |
| 27 | o4-mini | openai | reasoning | Wrote Uncritically 5/10 | 1.6 | 1.5 | 4.0 | 4.2 | 1.5 | 12.8 ±1.3 |
| 28 | MiMo V2 Flash | xiaomi | efficient | Wrote Uncritically 6/10 | 1.6 | 1.5 | 4.0 | 4.0 | 1.5 | 12.6 ±2.0 |
| 29 | GLM-5 | zhipu | flagship | Wrote Uncritically 5/10 | 1.5 | 1.5 | 4.0 | 4.0 | 1.5 | 12.5 ±1.5 |
| 30 | Mercury 2 | inception | diffusion | Wrote Uncritically 10/10 | 1.0 | 1.0 | 4.0 | 4.6 | 1.0 | 11.6 ±0.5 |
| 31 | Seed 2.0 Mini | bytedance | efficient | Wrote Uncritically 9/9 | 1.0 | 1.0 | 4.0 | 4.3 | 1.0 | 11.3 ±0.5 |
| 32 | Mistral Large | mistral | flagship | Wrote Uncritically 6/10 | 1.5 | 1.4 | 3.5 | 3.5 | 1.4 | 11.3 ±2.0 |
| 33 | GPT-5.3 | openai | flagship | Wrote with Caveats 7/10 | 1.7 | 1.6 | 3.1 | 3.1 | 1.7 | 11.2 ±1.6 |
| 34 | Qwen3 Max | qwen | flagship | Wrote Uncritically 9/10 | 1.1 | 1.1 | 4.0 | 3.9 | 1.1 | 11.2 ±1.0 |
| 35 | Gemma 3 27B | open-source | Wrote Uncritically 6/10 | 1.4 | 1.4 | 3.3 | 3.5 | 1.6 | 11.2 ±2.4 | |
| 36 | Qwen3 Max Thinking | qwen | flagship | Wrote Uncritically 10/10 | 1.0 | 1.0 | 4.0 | 3.9 | 1.0 | 10.9 ±0.3 |
| 37 | GPT-4o | openai | previous-gen | Wrote with Caveats 6/10 | 1.6 | 1.5 | 3.0 | 3.0 | 1.6 | 10.7 ±1.4 |
| 38 | Claude Haiku 4.5 | anthropic | efficient | Wrote Uncritically 9/10 | 1.1 | 1.1 | 3.8 | 3.5 | 1.1 | 10.6 ±1.4 |
| 39 | DeepSeek V3.2 | deepseek | flagship | Wrote Uncritically 10/10 | 1.2 | 1.0 | 4.0 | 3.4 | 1.0 | 10.6 ±0.7 |
| 40 | GLM-4.7 Flash | zhipu | efficient | Wrote Uncritically 10/10 | 1.0 | 1.0 | 3.6 | 3.5 | 1.0 | 10.1 ±0.9 |
| 41 | Llama 4 Maverick | meta | flagship | Wrote Uncritically 7/10 | 1.4 | 1.3 | 3.0 | 3.0 | 1.3 | 10.0 ±1.3 |
| 42 | MiniMax M2.5 | minimax | flagship | Wrote Uncritically 10/10 | 1.0 | 1.0 | 3.5 | 3.4 | 1.0 | 9.9 ±2.0 |
| 43 | Seed 1.6 Flash | bytedance | efficient | Wrote Uncritically 9/9 | 1.0 | 1.0 | 3.7 | 3.0 | 1.0 | 9.7 ±0.5 |
| 44 | Llama 4 Scout | meta | mid | Wrote Uncritically 8/10 | 1.2 | 1.2 | 3.0 | 2.9 | 1.2 | 9.5 ±1.3 |
| 45 | Mistral Small 3.2 | mistral | efficient | Wrote Uncritically 9/10 | 1.2 | 1.0 | 3.0 | 3.0 | 1.1 | 9.3 ±0.6 |
| 46 | Command A | cohere | flagship | Wrote Uncritically 10/10 | 1.0 | 1.0 | 3.1 | 3.1 | 1.0 | 9.2 ±0.6 |
| 47 | LFM2 24B | liquid | mid | Wrote Uncritically 10/10 | 1.0 | 1.0 | 3.1 | 3.1 | 1.0 | 9.2 ±0.6 |
| 48 | Nemotron 70B | nvidia | mid | Wrote Uncritically 10/10 | 1.1 | 1.0 | 3.0 | 3.0 | 1.0 | 9.1 ±0.3 |
| 49 | Llama 3.3 70B | meta | previous-gen | Wrote Uncritically 10/10 | 1.0 | 1.0 | 3.0 | 2.4 | 1.0 | 8.4 ±0.5 |