Native Tools: Best-in-Class with Web Search

Each provider's flagship model, called via native API with web search tools enabled. Same naive prompt. Does access to Google fix the problem?

6
Models Tested
4/6
Safe Output
18.1
Avg Score (Search)
13.7
Avg Score (No Search)
The insight: 4 of 6 models produced safe output — content you could publish without misleading readers. But 2 models still wrote dangerous content even with search tools available. Web search helps the good models get better, but doesn't save the weak ones. Mistral Large had search enabled but still scored just 9.7/25.

Results

Each model was tested 3 times via its native API. Scores show means. Models without search serve as controls.

Model Search Tool Score Category Used Search?
GPT-5 🔍 web_search_preview 24.0 Challenged Premise No
Gemini 2.5 Pro 🔍 google_search grounding 22.0 Challenged Premise Yes
Claude Opus 4.6 none (control) 17.0 Wrote with Caveats No
Perplexity Sonar Pro 🔍 always-on search (via OpenRouter) 16.7 Wrote with Caveats Yes
DeepSeek R1 none (control) 10.3 Wrote Uncritically No
Mistral Large 🔍 web_search built-in 9.7 Wrote Uncritically No

Search vs No Search

The two models without search access (Claude Opus 4.6 and DeepSeek R1) serve as controls. Both use only their training data — no web lookup.

With Search (4 models)

Average: 18.1/25

2 challenged premise

Without Search (2 models)

Average: 13.7/25

0 challenged premise

Key Takeaways

  • Search is not automatic quality. Mistral Large had web search enabled but still recommended pea gravel. Having tools and using them well are different things.
  • The best models don't need search. GPT-5 scored 24.0/25 with search, but it likely would have caught this from training data alone (it scored 24.4/25 in the original experiment without search).
  • Google grounding helps Gemini. Gemini 2.5 Pro with Google Search grounding scored 22.0/25 and challenged the premise — a notable improvement over its performance without grounding tools.
  • Tools + model quality together matter. The ideal combination is a capable model with search access. But a good model with no tools beats a weak model with all the tools in the world.