GPT-4o vs Claude 3.5 vs Gemini 1.5: Which Model Wins on Evaluation Quality?

By the Confident AI Research Team · 14 min read

Public leaderboards optimize for the metrics that make model providers look good: MMLU accuracy, HumanEval pass rate, reasoning benchmarks with known answers. What they do not measure is how the model behaves when deployed in a production application with a real system prompt, a real retrieval pipeline, and real users asking ambiguous questions. That gap is significant, and the rankings it produces differ substantially from leaderboard positions.

We ran GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro through the same set of application-focused evaluation tasks using Confident AI's evaluation suite. This is not a comprehensive academic study; it is a practical comparison of how each model performs on the dimensions that matter for production deployment decisions.

Methodology Note

All models were tested at default API settings with identical system prompts and evaluation datasets. Tasks covered customer support (350 queries), document Q&A (280 queries), and code explanation (180 queries). Models were not fine-tuned. Results reflect API behavior as of July 2024.

Hallucination Rate by Task Type

Hallucination rates varied significantly across models and task types. The headline numbers mask important patterns:

Task Type	GPT-4o	Claude 3.5 Sonnet	Gemini 1.5 Pro
Customer Support Q&A	4.2%	3.1%	6.8%
Document Q&A (RAG)	7.1%	5.4%	9.3%
Code Explanation	2.8%	2.2%	3.9%

Claude 3.5 Sonnet had the lowest hallucination rate across all three task types. The margin is meaningful for document Q&A, where a 1.7 percentage point gap between Claude and GPT-4o translates to roughly one additional hallucinated claim per 60 queries. At production scale, that matters.

Gemini 1.5 Pro's higher hallucination rates in the RAG tasks suggest lower faithfulness to retrieved context: the model is more willing to supplement retrieved information with parametric memory. This is useful in some applications (where the knowledge base has gaps) and dangerous in others (where the knowledge base is meant to be authoritative).

Instruction Following Under Adversarial Conditions

We tested instruction adherence under three conditions: standard queries, queries with conflicting instructions in user and system prompts, and queries designed to test scope restriction (asking the model to help with tasks outside its defined scope).

Results diverged sharply on scope restriction. GPT-4o complied with out-of-scope requests 18% of the time when the request was framed as "just a quick question" outside the defined domain. Claude 3.5 Sonnet complied 9% of the time. Gemini 1.5 Pro complied 24% of the time.

For applications where staying on-topic is a product or compliance requirement, this is a significant consideration. A customer support bot that answers general knowledge questions 24% of the time it is asked is not behaving as specified, regardless of how accurate those answers are.

Faithfulness and Answer Relevancy

On faithfulness to retrieved context, Claude 3.5 Sonnet scored 0.91, GPT-4o scored 0.87, and Gemini 1.5 Pro scored 0.83. The faithfulness gap between Claude and Gemini is the largest single differentiator we found across all metrics.

On answer relevancy, the models performed more similarly: GPT-4o at 0.89, Claude 3.5 at 0.88, Gemini 1.5 at 0.87. All three models are reasonably good at addressing the actual question asked. The differences here are within measurement noise for most applications.

The Model Selection Decision Is Task-Specific

The biggest mistake teams make in model selection is choosing once and assuming the choice holds across all use cases. Different evaluation profiles make different models appropriate for different applications:

High-stakes factual applications (legal, medical, financial): Claude 3.5 Sonnet's lower hallucination rate and higher faithfulness make it the safer default. The cost premium is justified by the reduced compliance risk.
High-volume, cost-sensitive applications: GPT-4o's balance of quality and pricing makes it competitive. The slightly higher hallucination rate can be mitigated with well-designed evaluation gates.
Applications needing long-context processing: Gemini 1.5 Pro's extended context window is a genuine differentiator for document processing tasks, despite its lower faithfulness score. Pair with stricter faithfulness gating.

The key conclusion: run your own evaluation on your specific task before committing to a model provider. Public leaderboard performance and your production application performance can diverge significantly, as described in detail in our hallucination rate analysis. The Confident AI platform makes it straightforward to run identical evaluation suites against multiple model providers so you can compare on your actual use case, not on synthetic benchmarks.

Model comparison evaluations are available on all Confident AI plans. Run the same dataset against multiple providers and compare results side by side. Start free today.

← Back to Blog

Ready to Improve Your LLM Quality?

Start with the free Developer plan. No credit card required, and you'll have your first evaluation running in under 10 minutes.

Start Free Trial Explore Platform