Building an LLM Evaluation Dataset That Actually Predicts Production Quality

By the Confident AI Engineering Team · 15 min read

The most common evaluation dataset failure mode is not low quality labeling or insufficient size. It is construction by engineers who know the system. When the people building the evaluation dataset are the same people who built the system prompt and the retrieval pipeline, the dataset reflects their mental model of how users will interact with the application, not how users actually do.

This manifests as evaluation datasets that are too clean: well-formed queries, complete context, no ambiguity, no multi-intent, no domain terminology used incorrectly. Real user queries have all of those properties. An evaluation dataset built without them will consistently underestimate production failure rates, as we covered in depth in the dev-to-production gap analysis.

The Structural Requirement: Coverage Across Five Query Dimensions

A robust evaluation dataset should have deliberate coverage across at least five query dimensions:

1. Clarity spectrum. Include queries ranging from well-formed and unambiguous to genuinely ambiguous and underspecified. The model's behavior on ambiguous input is often what users notice first, because it determines whether the model asks clarifying questions, makes reasonable assumptions, or hallucinates a resolution.

2. Domain vocabulary range. Include queries using correct technical terminology, queries using lay terminology for the same concepts, and queries using incorrect terminology that implies a misunderstanding. A customer support chatbot for a technical product will regularly receive queries from non-technical users. If the evaluation dataset only includes expert-vocabulary queries, you will miss the failure modes most common users encounter.

3. Multi-intent queries. Real users combine multiple questions in a single message. "What's the return policy and also do you offer student discounts and also how do I contact support" is a realistic input. Evaluation datasets that only test single-intent queries miss a significant portion of the production input distribution.

4. Out-of-scope queries. At least 10-15% of your evaluation dataset should consist of queries the model is supposed to decline or redirect. Measuring out-of-scope adherence requires having out-of-scope test cases. This is the dimension most commonly missing from evaluation datasets.

5. Known failure cases. After your first month of production traffic, you should have a set of queries the model has already failed on. These become permanent members of the evaluation dataset. They are the highest-value test cases because they represent documented production failures that must not recur.

Sourcing Strategies: Where to Get Real Queries

The best source for evaluation queries is production logs. If your application is in beta or early access, you should be logging every user query (with appropriate privacy controls) and mining that log for evaluation cases within the first two weeks.

For pre-launch datasets, two sourcing strategies work well:

User research sessions. Have 5-10 people in your target user category use a prototype version of the application and observe their natural query patterns. Record the actual queries they type, not what you expected them to type. Even a single 60-minute session will surface query patterns the engineering team had not anticipated.

Synthetic expansion with perturbation. Start with a seed set of engineer-written queries. Then apply systematic perturbations: add typos, split queries into fragments, combine multiple queries, replace technical terms with lay equivalents. This mechanically increases distribution coverage without requiring user research sessions at scale.

Labeling Strategy: Expected Outputs vs. Evaluation Criteria

Two approaches to evaluation dataset labeling are commonly used, and they are appropriate for different metrics:

Expected output labeling assigns a ground-truth correct answer to each query. This is required for answer correctness metrics and optional for others. It is labor-intensive but enables the most precise evaluation. Use this approach for factual queries where there is a clear correct answer: product specifications, policy terms, documented procedures.

Criteria labeling instead specifies what properties a correct answer must have, without specifying the exact output. For example: "The answer must mention the 30-day return window, must not claim the policy applies to digital purchases, and must not provide specific dollar refund amounts." Criteria labeling is faster to produce and more robust to the natural output variation of LLMs.

For most production evaluation datasets, a hybrid approach works best: expected output labeling for high-stakes factual cases, criteria labeling for open-ended and conversational queries. The Confident AI platform supports both labeling modes and uses them appropriately based on the metric being computed.

Dataset Size: The Tradeoff Between Coverage and CI Speed

Dataset size is a tradeoff between statistical coverage and CI pipeline speed. Larger datasets give more reliable estimates of true population performance but run slower and cost more per evaluation run.

A practical tiered approach:

CI gate dataset (50-100 cases): Runs on every deploy. Designed to catch regressions quickly, not to provide comprehensive coverage. Biased toward known failure cases and high-risk query categories.
Weekly evaluation dataset (500-1,000 cases): Runs on a schedule, not on every deploy. Provides statistically meaningful coverage for trend analysis and threshold calibration.
Model selection dataset (2,000+ cases): Runs when evaluating a new model provider or a major model version upgrade. Comprehensive coverage across all dimensions.

Maintaining the Dataset Over Time

An evaluation dataset that is not updated is a dataset that will gradually become less predictive. User behavior shifts, new use cases emerge, and the application's system prompt evolves. Your dataset needs to evolve with it.

Establish a monthly dataset maintenance process: review the past month's production failures, add representative cases to the CI dataset, update expected outputs for any policy changes, and retire cases that have become unrepresentative (old product versions, discontinued features). This process typically takes two to three hours per month for a mature application and is one of the highest-value activities an ML engineer can do.

The dataset is not a deliverable. It is a living artifact that reflects the current understanding of where your application can fail. Teams that treat it as a one-time construction project consistently see evaluation metrics diverge from production quality over time. Teams that treat it as ongoing infrastructure consistently see evaluation metrics track production quality closely.

Confident AI includes a dataset management workspace for building, versioning, and maintaining evaluation datasets. Learn about the dataset tools or start a free trial.

← Back to Blog

Ready to Improve Your LLM Quality?

Start with the free Developer plan. No credit card required, and you'll have your first evaluation running in under 10 minutes.

Start Free Trial Explore Platform