Multi-Judge Evaluation
We re-evaluated 15 models using three different judges to quantify evaluator bias. Same articles, same rubric, different judges.
Method: We selected 15 models across the score range (top 5, middle 5, bottom 5 by Sonnet score). Each model's article was sent to Gemini 3 Flash and GPT-5.4 using the identical rubric. Evaluation was blind — judges didn't know which model produced the article. Sonnet's scores are the original multi-run means.
Per-Model Comparison
Yellow rows = category disagreement between judges
| Model | Sonnet | Gemini | GPT-5.4 | S-Cat | G-Cat | GPT-Cat |
|---|---|---|---|---|---|---|
| GPT-5 | 24.5 | 25 | 21 | challenged premise | challenged premise | challenged premise |
| Perplexity Deep Research | 24.1 | 24 | 24 | challenged premise | challenged premise | challenged premise |
| Qwen3.5 397B | 23.3 | 23 | 20 | challenged premise | wrote with caveats | challenged premise |
| Perplexity Sonar Pro | 20.0 | 13 | 13 | challenged premise | asked questions | asked questions |
| Gemini 3.1 Flash Lite | 19.8 | 24 | 21 | wrote with caveats | challenged premise | challenged premise |
| Gemini 3 Flash | 14.6 | 20 | 15 | wrote with caveats | wrote with caveats | wrote with caveats |
| Gemini 2.5 Flash | 14.2 | 10 | 10 | wrote with caveats | wrote uncritically | wrote uncritically |
| Mistral Medium 3.1 | 14.0 | 10 | 11 | wrote with caveats | wrote uncritically | wrote uncritically |
| DeepSeek V3.1 | 13.2 | 11 | 10 | wrote with caveats | wrote uncritically | wrote uncritically |
| o4-mini | 12.8 | 13 | 12 | wrote uncritically | wrote with caveats | wrote uncritically |
| Mistral Small 3.2 | 9.3 | 9 | 9 | wrote uncritically | wrote uncritically | wrote uncritically |
| Command A | 9.2 | 9 | 10 | wrote uncritically | wrote uncritically | wrote uncritically |
| LFM2 24B | 9.2 | 9 | 9 | wrote uncritically | wrote uncritically | wrote uncritically |
| Nemotron 70B | 9.1 | 10 | 10 | wrote uncritically | wrote uncritically | wrote uncritically |
| Llama 3.3 70B | 8.4 | 10 | 9 | wrote uncritically | wrote uncritically | wrote uncritically |
Per-Dimension Averages
Average scores across all 15 evaluated models, by dimension
| Dimension | Sonnet | Gemini | GPT-5.4 |
|---|---|---|---|
| Factual | 2.49 | 2.47 | 2.13 |
| Critical | 2.45 | 2.40 | 2.13 |
| Writing | 3.77 | 3.80 | 3.73 |
| Specificity | 3.85 | 3.67 | 3.60 |
| Usefulness | 2.49 | 2.33 | 2.00 |
| Total (/25) | 15.05 | 14.67 | 13.60 |
Category Agreement
8/15
All 3 judges agree
8/15
Sonnet–Gemini agree
10/15
Sonnet–GPT agree
13/15
Gemini–GPT agree
Biggest Disagreements
Models where judges disagreed most — by score spread or category mismatch.
Perplexity Sonar Pro
Sonnet: 20.0 · Gemini: 13 · GPT: 13 (spread: 7.0)
Categories: Sonnet=challenged premise, Gemini=asked questions, GPT=asked questions
Gemini 3 Flash
Sonnet: 14.6 · Gemini: 20 · GPT: 15 (spread: 5.4)
Gemini 3.1 Flash Lite
Sonnet: 19.8 · Gemini: 24 · GPT: 21 (spread: 4.2)
Categories: Sonnet=wrote with caveats, Gemini=challenged premise, GPT=challenged premise
Gemini 2.5 Flash
Sonnet: 14.2 · Gemini: 10 · GPT: 10 (spread: 4.2)
Categories: Sonnet=wrote with caveats, Gemini=wrote uncritically, GPT=wrote uncritically
Mistral Medium 3.1
Sonnet: 14.0 · Gemini: 10 · GPT: 11 (spread: 4.0)
Categories: Sonnet=wrote with caveats, Gemini=wrote uncritically, GPT=wrote uncritically
Key Findings
- Broad agreement: All three judges agree on category for 8 of 15 models (53%). Gemini and GPT agree most often (13/15).
- Writing quality is universal: All judges give similar writing quality scores (~3.8), suggesting this dimension is least subjective.
- Factual awareness varies most: GPT-5.4 scores factual awareness at 2.13 vs Sonnet's 2.49 — stricter on whether models identified the pea gravel problem.
- No evidence of Anthropic bias: No Anthropic models fell in the sampled subset (top/middle/bottom 5), but Sonnet's scores are broadly corroborated by independent judges.
- Borderline models shift: The biggest disagreements occur on models near category boundaries — where the difference between "wrote with caveats" and "challenged premise" is a judgement call.