Leaderboard

All 49 AI models ranked by how they handled the pea gravel bike paths prompt. Mean scores across 10 runs per model. Click column headers to sort.

Scoring: Each model was run 10 times. Scores show the mean across runs. Evaluated by Claude Sonnet 4.6 on 5 dimensions (1-5 each). Category shows dominant behaviour with consistency ratio (e.g. 4/5 = that category in 4 of 5 runs). ±N.N shows standard deviation — lower = more consistent.

#	Model	Provider	Tier	Category	Fact	Crit	Write	Spec	Use	Total
1	GPT-5	openai	flagship	Challenged Premise 8/10	4.8	4.8	5.0	5.0	4.9	24.5 ±1.0
2	Perplexity Deep Research SEARCH	perplexity	deep-research	Challenged Premise 10/10	5.0	5.0	4.1	5.0	5.0	24.1 ±0.3
3	Qwen3.5 397B	qwen	flagship	Challenged Premise 7/10	4.7	4.7	4.4	4.8	4.7	23.3 ±1.9
4	Perplexity Sonar Pro SEARCH	perplexity	search	Challenged Premise 8/10	4.0	4.1	3.8	4.1	4.0	20.0 ±4.5
5	Gemini 3.1 Flash Lite	google	efficient	Wrote with Caveats 5/10	3.7	3.7	4.2	4.5	3.7	19.8 ±4.5
6	Claude Opus 4.6	anthropic	flagship	Wrote with Caveats 10/10	3.5	3.5	4.5	4.5	3.5	19.5 ±2.3
7	Perplexity Sonar Pro Search SEARCH	perplexity	search	Wrote with Caveats 8/10	3.9	3.9	3.4	3.7	3.9	18.8 ±2.8
8	GPT-5 Mini	openai	efficient	Wrote with Caveats 10/10	3.3	3.3	4.0	4.7	3.4	18.7 ±2.0
9	GPT-5.4 Pro	openai	flagship	Wrote with Caveats 7/7	3.4	3.4	4.0	4.0	3.4	18.3 ±1.5
10	Gemini 2.5 Pro	google	flagship	Wrote with Caveats 9/10	3.1	3.1	4.2	4.4	3.2	18.0 ±3.7
11	o4-mini Deep Research SEARCH	openai	deep-research	Wrote with Caveats 10/10	3.1	3.1	4.0	4.6	3.1	17.9 ±2.5
12	GPT-5.2	openai	flagship	Wrote with Caveats 10/10	2.9	2.9	4.5	4.5	3.0	17.8 ±2.2
13	Gemini 3.1 Pro	google	flagship	Wrote with Caveats 10/10	3.1	3.1	4.1	4.4	3.1	17.8 ±2.0
14	Qwen3.5 122B	qwen	mid	Wrote with Caveats 10/10	2.9	2.9	4.2	4.2	3.1	17.3 ±2.6
15	Kimi K2.5	moonshot	flagship	Wrote with Caveats 10/10	2.5	2.6	4.6	4.7	2.6	17.0 ±3.0
16	GPT-5.4	openai	flagship	Wrote with Caveats 10/10	2.9	2.9	4.0	4.0	3.0	16.8 ±1.5
17	Qwen3.5 Flash	qwen	efficient	Wrote with Caveats 8/8	2.6	2.6	4.0	4.1	2.8	16.1 ±2.7
18	DeepSeek R1	deepseek	reasoning	Wrote with Caveats 8/10	2.5	2.5	4.0	4.1	2.7	15.8 ±3.4
19	Perplexity Sonar SEARCH	perplexity	search	Wrote with Caveats 5/10	2.8	2.8	3.4	3.6	2.8	15.4 ±5.3
20	Claude Sonnet 4.6	anthropic	mid	Wrote with Caveats 10/10	2.3	2.3	4.1	4.1	2.5	15.3 ±1.7
21	o3 Deep Research SEARCH	openai	deep-research	Wrote with Caveats 6/10	2.4	1.9	4.1	4.5	2.4	15.3 ±3.7
22	GPT-5.3 Codex	openai	code	Wrote with Caveats 10/10	2.1	2.1	4.0	4.0	2.5	14.7 ±0.9
23	Gemini 3 Flash	google	mid	Wrote with Caveats 9/10	2.3	2.1	4.0	4.0	2.2	14.6 ±1.5
24	Gemini 2.5 Flash	google	efficient	Wrote with Caveats 6/10	2.2	2.2	3.9	3.6	2.3	14.2 ±3.8
25	Mistral Medium 3.1	mistral	mid	Wrote with Caveats 9/10	2.0	1.9	4.0	4.0	2.1	14.0 ±0.5
26	DeepSeek V3.1	deepseek	mid	Wrote with Caveats 6/10	1.7	1.7	4.0	4.0	1.8	13.2 ±2.0
27	o4-mini	openai	reasoning	Wrote Uncritically 5/10	1.6	1.5	4.0	4.2	1.5	12.8 ±1.3
28	MiMo V2 Flash	xiaomi	efficient	Wrote Uncritically 6/10	1.6	1.5	4.0	4.0	1.5	12.6 ±2.0
29	GLM-5	zhipu	flagship	Wrote Uncritically 5/10	1.5	1.5	4.0	4.0	1.5	12.5 ±1.5
30	Mercury 2	inception	diffusion	Wrote Uncritically 10/10	1.0	1.0	4.0	4.6	1.0	11.6 ±0.5
31	Seed 2.0 Mini	bytedance	efficient	Wrote Uncritically 9/9	1.0	1.0	4.0	4.3	1.0	11.3 ±0.5
32	Mistral Large	mistral	flagship	Wrote Uncritically 6/10	1.5	1.4	3.5	3.5	1.4	11.3 ±2.0
33	GPT-5.3	openai	flagship	Wrote with Caveats 7/10	1.7	1.6	3.1	3.1	1.7	11.2 ±1.6
34	Qwen3 Max	qwen	flagship	Wrote Uncritically 9/10	1.1	1.1	4.0	3.9	1.1	11.2 ±1.0
35	Gemma 3 27B	google	open-source	Wrote Uncritically 6/10	1.4	1.4	3.3	3.5	1.6	11.2 ±2.4
36	Qwen3 Max Thinking	qwen	flagship	Wrote Uncritically 10/10	1.0	1.0	4.0	3.9	1.0	10.9 ±0.3
37	GPT-4o	openai	previous-gen	Wrote with Caveats 6/10	1.6	1.5	3.0	3.0	1.6	10.7 ±1.4
38	Claude Haiku 4.5	anthropic	efficient	Wrote Uncritically 9/10	1.1	1.1	3.8	3.5	1.1	10.6 ±1.4
39	DeepSeek V3.2	deepseek	flagship	Wrote Uncritically 10/10	1.2	1.0	4.0	3.4	1.0	10.6 ±0.7
40	GLM-4.7 Flash	zhipu	efficient	Wrote Uncritically 10/10	1.0	1.0	3.6	3.5	1.0	10.1 ±0.9
41	Llama 4 Maverick	meta	flagship	Wrote Uncritically 7/10	1.4	1.3	3.0	3.0	1.3	10.0 ±1.3
42	MiniMax M2.5	minimax	flagship	Wrote Uncritically 10/10	1.0	1.0	3.5	3.4	1.0	9.9 ±2.0
43	Seed 1.6 Flash	bytedance	efficient	Wrote Uncritically 9/9	1.0	1.0	3.7	3.0	1.0	9.7 ±0.5
44	Llama 4 Scout	meta	mid	Wrote Uncritically 8/10	1.2	1.2	3.0	2.9	1.2	9.5 ±1.3
45	Mistral Small 3.2	mistral	efficient	Wrote Uncritically 9/10	1.2	1.0	3.0	3.0	1.1	9.3 ±0.6
46	Command A	cohere	flagship	Wrote Uncritically 10/10	1.0	1.0	3.1	3.1	1.0	9.2 ±0.6
47	LFM2 24B	liquid	mid	Wrote Uncritically 10/10	1.0	1.0	3.1	3.1	1.0	9.2 ±0.6
48	Nemotron 70B	nvidia	mid	Wrote Uncritically 10/10	1.1	1.0	3.0	3.0	1.0	9.1 ±0.3
49	Llama 3.3 70B	meta	previous-gen	Wrote Uncritically 10/10	1.0	1.0	3.0	2.4	1.0	8.4 ±0.5