Multi-Judge Evaluation

We re-evaluated 15 models using three different judges to quantify evaluator bias. Same articles, same rubric, different judges.

Models Re-Evaluated

Judge Models

8/15

All 3 Agree on Category

Method: We selected 15 models across the score range (top 5, middle 5, bottom 5 by Sonnet score). Each model's article was sent to Gemini 3 Flash and GPT-5.4 using the identical rubric. Evaluation was blind — judges didn't know which model produced the article. Sonnet's scores are the original multi-run means.

Per-Model Comparison

Yellow rows = category disagreement between judges

Model	Sonnet	Gemini	GPT-5.4	S-Cat	G-Cat	GPT-Cat
GPT-5	24.5	25	21	challenged premise	challenged premise	challenged premise
Perplexity Deep Research	24.1	24	24	challenged premise	challenged premise	challenged premise
Qwen3.5 397B	23.3	23	20	challenged premise	wrote with caveats	challenged premise
Perplexity Sonar Pro	20.0	13	13	challenged premise	asked questions	asked questions
Gemini 3.1 Flash Lite	19.8	24	21	wrote with caveats	challenged premise	challenged premise
Gemini 3 Flash	14.6	20	15	wrote with caveats	wrote with caveats	wrote with caveats
Gemini 2.5 Flash	14.2	10	10	wrote with caveats	wrote uncritically	wrote uncritically
Mistral Medium 3.1	14.0	10	11	wrote with caveats	wrote uncritically	wrote uncritically
DeepSeek V3.1	13.2	11	10	wrote with caveats	wrote uncritically	wrote uncritically
o4-mini	12.8	13	12	wrote uncritically	wrote with caveats	wrote uncritically
Mistral Small 3.2	9.3	9	9	wrote uncritically	wrote uncritically	wrote uncritically
Command A	9.2	9	10	wrote uncritically	wrote uncritically	wrote uncritically
LFM2 24B	9.2	9	9	wrote uncritically	wrote uncritically	wrote uncritically
Nemotron 70B	9.1	10	10	wrote uncritically	wrote uncritically	wrote uncritically
Llama 3.3 70B	8.4	10	9	wrote uncritically	wrote uncritically	wrote uncritically

Per-Dimension Averages

Average scores across all 15 evaluated models, by dimension

Dimension	Sonnet	Gemini	GPT-5.4
Factual	2.49	2.47	2.13
Critical	2.45	2.40	2.13
Writing	3.77	3.80	3.73
Specificity	3.85	3.67	3.60
Usefulness	2.49	2.33	2.00
Total (/25)	15.05	14.67	13.60

Category Agreement

8/15

All 3 judges agree

8/15

Sonnet–Gemini agree

10/15

Sonnet–GPT agree

13/15

Gemini–GPT agree

Biggest Disagreements

Models where judges disagreed most — by score spread or category mismatch.

Perplexity Sonar Pro

Sonnet: 20.0 · Gemini: 13 · GPT: 13 (spread: 7.0)

Categories: Sonnet=challenged premise, Gemini=asked questions, GPT=asked questions

Gemini 3 Flash

Sonnet: 14.6 · Gemini: 20 · GPT: 15 (spread: 5.4)

Gemini 3.1 Flash Lite

Sonnet: 19.8 · Gemini: 24 · GPT: 21 (spread: 4.2)

Categories: Sonnet=wrote with caveats, Gemini=challenged premise, GPT=challenged premise

Gemini 2.5 Flash

Sonnet: 14.2 · Gemini: 10 · GPT: 10 (spread: 4.2)

Categories: Sonnet=wrote with caveats, Gemini=wrote uncritically, GPT=wrote uncritically

Mistral Medium 3.1

Sonnet: 14.0 · Gemini: 10 · GPT: 11 (spread: 4.0)

Categories: Sonnet=wrote with caveats, Gemini=wrote uncritically, GPT=wrote uncritically

Key Findings

Broad agreement: All three judges agree on category for 8 of 15 models (53%). Gemini and GPT agree most often (13/15).
Writing quality is universal: All judges give similar writing quality scores (~3.8), suggesting this dimension is least subjective.
Factual awareness varies most: GPT-5.4 scores factual awareness at 2.13 vs Sonnet's 2.49 — stricter on whether models identified the pea gravel problem.
No evidence of Anthropic bias: No Anthropic models fell in the sampled subset (top/middle/bottom 5), but Sonnet's scores are broadly corroborated by independent judges.
Borderline models shift: The biggest disagreements occur on models near category boundaries — where the difference between "wrote with caveats" and "challenged premise" is a judgement call.