Native Tools: Best-in-Class with Web Search

Each provider's flagship model, called via native API with web search tools enabled. Same naive prompt. Does access to Google fix the problem?

Models Tested

4/6

Safe Output

18.1

Avg Score (Search)

13.7

Avg Score (No Search)

The insight: 4 of 6 models produced safe output — content you could publish without misleading readers. But 2 models still wrote dangerous content even with search tools available. Web search helps the good models get better, but doesn't save the weak ones. Mistral Large had search enabled but still scored just 9.7/25.

Results

Each model was tested 3 times via its native API. Scores show means. Models without search serve as controls.

Model	Search	Tool	Score	Category	Used Search?
GPT-5	🔍	web_search_preview	24.0	Challenged Premise	No
Gemini 2.5 Pro	🔍	google_search grounding	22.0	Challenged Premise	Yes
Claude Opus 4.6	—	none (control)	17.0	Wrote with Caveats	No
Perplexity Sonar Pro	🔍	always-on search (via OpenRouter)	16.7	Wrote with Caveats	Yes
DeepSeek R1	—	none (control)	10.3	Wrote Uncritically	No
Mistral Large	🔍	web_search built-in	9.7	Wrote Uncritically	No

Search vs No Search

The two models without search access (Claude Opus 4.6 and DeepSeek R1) serve as controls. Both use only their training data — no web lookup.

With Search (4 models)

Average: 18.1/25

2 challenged premise

Without Search (2 models)

Average: 13.7/25

0 challenged premise

Key Takeaways

Search is not automatic quality. Mistral Large had web search enabled but still recommended pea gravel. Having tools and using them well are different things.
The best models don't need search. GPT-5 scored 24.0/25 with search, but it likely would have caught this from training data alone (it scored 24.4/25 in the original experiment without search).
Google grounding helps Gemini. Gemini 2.5 Pro with Google Search grounding scored 22.0/25 and challenged the premise — a notable improvement over its performance without grounding tools.
Tools + model quality together matter. The ideal combination is a capable model with search access. But a good model with no tools beats a weak model with all the tools in the world.