Multi-Judge Evaluation

We re-evaluated 15 models using three different judges to quantify evaluator bias. Same articles, same rubric, different judges.

15
Models Re-Evaluated
3
Judge Models
8/15
All 3 Agree on Category
Method: We selected 15 models across the score range (top 5, middle 5, bottom 5 by Sonnet score). Each model's article was sent to Gemini 3 Flash and GPT-5.4 using the identical rubric. Evaluation was blind — judges didn't know which model produced the article. Sonnet's scores are the original multi-run means.

Per-Model Comparison

Yellow rows = category disagreement between judges

Model Sonnet Gemini GPT-5.4 S-Cat G-Cat GPT-Cat
GPT-5 24.5 25 21 challenged premise challenged premise challenged premise
Perplexity Deep Research 24.1 24 24 challenged premise challenged premise challenged premise
Qwen3.5 397B 23.3 23 20 challenged premise wrote with caveats challenged premise
Perplexity Sonar Pro 20.0 13 13 challenged premise asked questions asked questions
Gemini 3.1 Flash Lite 19.8 24 21 wrote with caveats challenged premise challenged premise
Gemini 3 Flash 14.6 20 15 wrote with caveats wrote with caveats wrote with caveats
Gemini 2.5 Flash 14.2 10 10 wrote with caveats wrote uncritically wrote uncritically
Mistral Medium 3.1 14.0 10 11 wrote with caveats wrote uncritically wrote uncritically
DeepSeek V3.1 13.2 11 10 wrote with caveats wrote uncritically wrote uncritically
o4-mini 12.8 13 12 wrote uncritically wrote with caveats wrote uncritically
Mistral Small 3.2 9.3 9 9 wrote uncritically wrote uncritically wrote uncritically
Command A 9.2 9 10 wrote uncritically wrote uncritically wrote uncritically
LFM2 24B 9.2 9 9 wrote uncritically wrote uncritically wrote uncritically
Nemotron 70B 9.1 10 10 wrote uncritically wrote uncritically wrote uncritically
Llama 3.3 70B 8.4 10 9 wrote uncritically wrote uncritically wrote uncritically

Per-Dimension Averages

Average scores across all 15 evaluated models, by dimension

DimensionSonnetGeminiGPT-5.4
Factual 2.49 2.47 2.13
Critical 2.45 2.40 2.13
Writing 3.77 3.80 3.73
Specificity 3.85 3.67 3.60
Usefulness 2.49 2.33 2.00
Total (/25) 15.05 14.67 13.60

Category Agreement

8/15
All 3 judges agree
8/15
Sonnet–Gemini agree
10/15
Sonnet–GPT agree
13/15
Gemini–GPT agree

Biggest Disagreements

Models where judges disagreed most — by score spread or category mismatch.

Perplexity Sonar Pro
Sonnet: 20.0 · Gemini: 13 · GPT: 13 (spread: 7.0)
Categories: Sonnet=challenged premise, Gemini=asked questions, GPT=asked questions
Gemini 3 Flash
Sonnet: 14.6 · Gemini: 20 · GPT: 15 (spread: 5.4)
Gemini 3.1 Flash Lite
Sonnet: 19.8 · Gemini: 24 · GPT: 21 (spread: 4.2)
Categories: Sonnet=wrote with caveats, Gemini=challenged premise, GPT=challenged premise
Gemini 2.5 Flash
Sonnet: 14.2 · Gemini: 10 · GPT: 10 (spread: 4.2)
Categories: Sonnet=wrote with caveats, Gemini=wrote uncritically, GPT=wrote uncritically
Mistral Medium 3.1
Sonnet: 14.0 · Gemini: 10 · GPT: 11 (spread: 4.0)
Categories: Sonnet=wrote with caveats, Gemini=wrote uncritically, GPT=wrote uncritically

Key Findings

  • Broad agreement: All three judges agree on category for 8 of 15 models (53%). Gemini and GPT agree most often (13/15).
  • Writing quality is universal: All judges give similar writing quality scores (~3.8), suggesting this dimension is least subjective.
  • Factual awareness varies most: GPT-5.4 scores factual awareness at 2.13 vs Sonnet's 2.49 — stricter on whether models identified the pea gravel problem.
  • No evidence of Anthropic bias: No Anthropic models fell in the sampled subset (top/middle/bottom 5), but Sonnet's scores are broadly corroborated by independent judges.
  • Borderline models shift: The biggest disagreements occur on models near category boundaries — where the difference between "wrote with caveats" and "challenged premise" is a judgement call.