What LLM Evaluation Looks Like at Scale: Lessons From 10,000 Test Runs

By Jeffrey Ip · 11 min read

Running evaluation at scale generates patterns that are not visible from individual evaluation runs. Since we began collecting aggregate metrics across the Confident AI platform, a number of findings have been consistent enough to be actionable for any team building production LLM applications. Some confirm what practitioners expect. Others do not.

The following is based on over 10,000 evaluation runs across more than 500 AI applications. We have removed identifying information from all examples. The patterns described reflect the aggregate dataset, not any individual customer's experience.

Finding 1: Evaluation Dataset Size Matters Less Than Teams Expect

Teams frequently ask how large their evaluation dataset needs to be to produce reliable results. The common expectation is that larger is consistently better. The data is more nuanced.

For regression detection (catching quality drops between deploys), evaluation datasets of 50 well-chosen cases perform comparably to datasets of 500 cases in terms of detecting regressions of more than 5 percentage points on any metric. The small dataset advantage is speed: a 50-case evaluation runs in under 3 minutes at typical API speeds; a 500-case evaluation takes 25-30 minutes. The small dataset disadvantage is sensitivity to small regressions (1-2 percentage points), which require larger datasets to detect reliably.

The practical recommendation: start with 50-75 carefully curated cases. Add cases when you need to catch smaller regressions, not before. Curating 50 good cases is more valuable than collecting 500 mediocre ones.

Finding 2: Faithfulness and Hallucination Are Not the Same Metric

Teams often treat faithfulness and hallucination rate as two names for the same measurement. The aggregate data shows they measure different things and diverge meaningfully in specific configurations.

Faithfulness measures whether claims are grounded in retrieved context. Hallucination rate measures whether claims are factually incorrect. A model with high faithfulness can still hallucinate if the retrieved context itself contains errors. A model with low faithfulness can have low hallucination rate if its parametric knowledge happens to be accurate for the task at hand.

In practice, faithfulness is the better operational metric because it is more controllable: you can improve faithfulness by fixing the retriever and tuning the system prompt. You have less direct control over parametric knowledge accuracy. Track both metrics but optimize for faithfulness first.

Finding 3: Model Provider Updates Are the Most Common Source of Regression

We analyzed the trigger events for evaluation gate failures across the platform. The distribution was not what most teams expect:

Model provider updates (silent): 41% of quality regressions
System prompt changes: 28%
RAG corpus updates: 19%
Application code changes: 12%

The 41% from silent model provider updates is the most consequential finding. These are cases where the application code, system prompt, and corpus are unchanged, but the model version at the provider changes without a public announcement. OpenAI, Anthropic, and Google all release intermediate model updates that can shift behavior without a major version number change.

Teams that only run evaluation on deploy will miss these regressions. Scheduled evaluation at least once per week is necessary to catch silent model updates in a reasonable timeframe.

Finding 4: Improvement Plateaus Around Faithfulness 0.92

Looking at the faithfulness score distribution across teams actively optimizing their RAG pipelines, improvement tends to plateau around 0.92. Teams that push from 0.85 to 0.92 see meaningful improvement in production quality metrics. Teams that attempt to push from 0.92 to 0.97 spend significantly more engineering time for diminishing returns on user experience.

The practical threshold to target for most customer-facing applications: faithfulness >0.90. Healthcare and financial services applications need higher thresholds (0.93-0.95). The remaining ceiling is approachable but requires increasingly specialized retrieval optimization that has better ROI only in high-stakes domains.

Finding 5: Teams That Fix Gate Failures Quickly Have Better Long-Term Quality

This one is more intuitive but the correlation is measurable: teams with mean gate-failure-to-fix times under 4 hours show lower absolute hallucination rates over six-month observation windows than teams with mean fix times over 24 hours, holding initial quality level constant.

The mechanism is probably selection: teams that invest in fast incident response have also invested in evaluation infrastructure quality, which means their evaluation is more predictive of production quality. But there may also be a direct effect: fast response to gate failures means fewer compounding regressions where one unfixed issue masks a second issue that then goes undetected.

The implication for teams setting up evaluation gates: define your incident response process before your first gate failure. Know who gets paged, what constitutes an emergency override, and what the 4-hour fix protocol looks like. The technical infrastructure is only half the investment; the process around it determines whether it actually reduces production quality incidents. The CI/CD quality gates guide covers the failure response workflow in detail.

Confident AI's analytics dashboard aggregates evaluation trends over time so you can see whether your quality is improving, holding steady, or drifting. Start a free trial to begin collecting your baseline data.

← Back to Blog

Ready to Improve Your LLM Quality?

Start with the free Developer plan. No credit card required, and you'll have your first evaluation running in under 10 minutes.

Start Free Trial Explore Platform