Native Tools: Best-in-Class with Web Search
Each provider's flagship model, called via native API with web search tools enabled. Same naive prompt. Does access to Google fix the problem?
The insight: 4 of 6 models produced safe output — content you could publish without misleading readers. But 2 models still wrote dangerous content even with search tools available. Web search helps the good models get better, but doesn't save the weak ones. Mistral Large had search enabled but still scored just 9.7/25.
Results
Each model was tested 3 times via its native API. Scores show means. Models without search serve as controls.
| Model | Search | Tool | Score | Category | Used Search? |
|---|---|---|---|---|---|
| GPT-5 | 🔍 | web_search_preview | 24.0 | Challenged Premise | No |
| Gemini 2.5 Pro | 🔍 | google_search grounding | 22.0 | Challenged Premise | Yes |
| Claude Opus 4.6 | — | none (control) | 17.0 | Wrote with Caveats | No |
| Perplexity Sonar Pro | 🔍 | always-on search (via OpenRouter) | 16.7 | Wrote with Caveats | Yes |
| DeepSeek R1 | — | none (control) | 10.3 | Wrote Uncritically | No |
| Mistral Large | 🔍 | web_search built-in | 9.7 | Wrote Uncritically | No |
Search vs No Search
The two models without search access (Claude Opus 4.6 and DeepSeek R1) serve as controls. Both use only their training data — no web lookup.
With Search (4 models)
Average: 18.1/25
2 challenged premise
Without Search (2 models)
Average: 13.7/25
0 challenged premise
Key Takeaways
- Search is not automatic quality. Mistral Large had web search enabled but still recommended pea gravel. Having tools and using them well are different things.
- The best models don't need search. GPT-5 scored 24.0/25 with search, but it likely would have caught this from training data alone (it scored 24.4/25 in the original experiment without search).
- Google grounding helps Gemini. Gemini 2.5 Pro with Google Search grounding scored 22.0/25 and challenged the premise — a notable improvement over its performance without grounding tools.
- Tools + model quality together matter. The ideal combination is a capable model with search access. But a good model with no tools beats a weak model with all the tools in the world.