Building an LLM Test Suite From Scratch

Most teams that want to run LLM evaluations get stuck at the same point: where to start. The mental model for testing comes from traditional software, where you're asserting that a function returns a specific value. LLMs don't work that way — the outputs are probabilistic and the quality is often a matter of degree, not binary pass/fail.

This makes starting a test suite feel harder than it is. In practice, you can get something useful running in a few hours. Here's the structure that works for most teams.

Start with what you already know is wrong

Before writing any new test cases, write down the failures you've already seen. Every team has them: the prompts that occasionally produce nonsense, the edge cases that came up in user feedback, the categories of output that required manual review after launch. These are your first test cases.

They're better than hypothetical cases because they're grounded in your actual system's behavior. You already know what "wrong" looks like for these inputs. That makes them easy to write assertions for.

Define your input/output pairs

Each test case needs two things: an input and a definition of what a passing output looks like. The input can be a single user message, a conversation context, or a document paired with a question. The expected output is more flexible — it doesn't have to be an exact string.

A few common formats:

Exact match: The output must equal or contain a specific string. Useful for classification tasks, structured output, or anything with a definitive correct answer.

Criteria-based: The output must satisfy a set of conditions: contains no PII, is under 200 words, mentions the product name, doesn't include competitor references. Each criterion is evaluated independently.

Rubric scoring: A second model evaluates the output against a rubric and produces a score. Useful for open-ended generation where quality matters but exact answers don't. You set the minimum acceptable score as the pass threshold.

Most real test suites use a mix. Deterministic checks catch hard failures cheaply. Rubric scoring covers the quality dimension that deterministic checks miss.

Organize by behavior, not by topic

A common mistake is organizing test cases by subject matter — all the billing questions together, all the product questions together. The more useful structure is by behavior: what property of the output are you testing?

Categories like accuracy, tone, safety, length, format, and grounding map onto different metrics and different failure modes. When a category's scores drop after a change, you know exactly what kind of problem you're looking at. When tests are organized by topic, a regression shows up as "billing tests failing," which could mean almost anything.

Seed your dataset deliberately

Ten test cases is enough to start. Two hundred is enough to run a serious evaluation. The ratio that works: 60% common cases (your typical user queries), 30% edge cases (unusual phrasing, complex questions, ambiguous inputs), and 10% adversarial cases (inputs designed to trigger failures).

Don't generate all your test cases synthetically. Synthetic cases tend to cluster around the same patterns. The most valuable test cases come from real user interactions — queries that users actually sent, conversations that ended in escalations, outputs that required correction. Start collecting those as soon as you have a system in production, even informally.

Write your first runner

A runner is just the code that takes your dataset, sends each input to your model, collects the outputs, and applies your metrics. You can start with a simple Python script that calls your model's API in a loop and prints results.

Once you have a working loop, connect your metrics library. For criteria-based checks, this might be a few assertion functions you write yourself. For LLM-as-a-judge scoring, you'll need a framework that handles the evaluation calls and standardizes the scores.

The output you want at this stage: a table showing each test case, the model's response, and the score on each metric. Anything that drops below threshold is a failure. Review the failures, understand why they failed, and update your prompts, model config, or test cases as needed.

The test suite is never done

The most useful mental model for an LLM test suite is a living document, not a checklist you complete. Every user complaint is a potential new test case. Every production incident should generate several. Every model update should run against the suite and reveal anything that changed.

The goal isn't 100% coverage — that's not achievable for open-ended AI systems. The goal is a dataset that's representative enough to catch real regressions quickly. Teams that treat their evaluation dataset as seriously as their codebase end up with dramatically better visibility into what their AI is doing.

What to measure first

If you're starting from nothing, pick two metrics and instrument them properly before adding more. Hallucination rate and task completion rate are usually the highest-signal combination for most use cases. Hallucination tells you about factual reliability. Task completion tells you whether the model is doing what users ask at all.

Add more metrics once you're confident your runner is reliable and your dataset is representative. More metrics on a broken runner doesn't help anyone.

← Back to Blog