Master Prompt Experiment

Does a well-crafted prompt raise the floor? We tested 15 models with a sophisticated prompt that explicitly asks them to validate the premise, cite sources, and challenge bad ideas.

15
Models Tested
12/15
Challenged Premise
+5.8
Avg Score Improvement
8
Category Upgrades
The insight: A good prompt raises the floor for every model. 8 of 15 models jumped from "wrote uncritically" or "wrote with caveats" to "challenged premise" — they knew pea gravel was bad for bikes, they just needed to be asked to think critically. But the floor isn't eliminated: some smaller models still only managed caveats even with explicit instructions.

Naive vs Master Prompt

Naive = "Write an article about pea gravel bike paths" (no instructions). Master = detailed prompt with premise validation, research requirements, and anti-AI-cliche rules. Green arrow = category upgrade.

Model Naive Master Diff Naive Category Master Category
Claude Haiku 4.5 10.6 24.0 +13.4 wrote uncritically challenged premise
Gemini 3 Flash 14.6 25.0 +10.4 wrote with caveats challenged premise
Claude Sonnet 4.6 15.3 25.0 +9.7 wrote with caveats challenged premise
Mistral Large 11.3 19.7 +8.4 wrote uncritically wrote with caveats
Gemini 3.1 Pro 17.8 25.0 +7.2 wrote with caveats challenged premise
Qwen3.5 122B 17.3 24.3 +7.0 wrote with caveats challenged premise
GPT-5.4 Pro 18.3 24.0 +5.7 wrote with caveats challenged premise
Claude Opus 4.6 19.5 25.0 +5.5 wrote with caveats challenged premise
GPT-5 Mini 18.7 23.7 +5.0 wrote with caveats challenged premise
Nemotron 70B 9.1 12.3 +3.2 wrote uncritically wrote with caveats
Llama 4 Maverick 10.0 13.0 +3.0 wrote uncritically wrote with caveats
Qwen3.5 397B 23.3 24.3 +1.0 challenged premise challenged premise
Perplexity Deep Research 24.1 25.0 +0.9 challenged premise challenged premise
GPT-5 24.5 25.0 +0.5 challenged premise challenged premise
DeepSeek R1 ? 21.0 ? ? challenged premise

What the Master Prompt Includes

The prompt explicitly instructs models to:

  • Validate the premise — "If anything about the topic seems questionable, say so before continuing"
  • Require real sources — "Do not cite statistics you are not confident are accurate"
  • Use Australian English — spelling, idioms, tone
  • Avoid AI writing patterns — no "in today's fast-paced world", no "game-changing"
  • Open with something real — a story or fact, not a definition
  • Run a final checklist — "Have you challenged any questionable premise?"

This isn't a trick prompt. It's the kind of detailed brief a professional content team would use. The question is: does it work?

Key Takeaways

  • Prompt quality matters more than model quality. Claude Haiku 4.5 (a small, fast model) jumped from 10.6 to 24.0 — a +13.4 improvement. With the right prompt, a cheap model outperforms an expensive one with a lazy prompt.
  • The floor isn't eliminated. Nemotron 70B and Llama 4 Maverick still only wrote with caveats (12-13/25). Good prompting helps every model, but some need more than instructions.
  • Models that already challenged the premise stay perfect. GPT-5 and Qwen3.5 397B barely changed — they were already at 24-25/25 with the naive prompt. The master prompt doesn't hurt good models.
  • The biggest gains come from the middle. Models that "wrote with caveats" (knew something was off but complied anyway) responded most dramatically to explicit permission to push back.