RAG Evaluation: Measuring Retrieval Quality in Production

Retrieval-Augmented Generation adds a step before generation: fetch relevant documents, then generate from them. This architecture solves the knowledge cutoff problem and reduces hallucinations on grounded tasks. It also doubles the failure surface. Bad retrieval produces bad context, which produces bad answers — and the generation model looks fine because it answered correctly from the context it was given.

Most teams evaluating RAG systems focus entirely on the final answer quality. They measure whether the output is accurate and helpful, and stop there. This tells you the system is failing but not why. Fixing a retrieval problem by improving your prompts doesn't work. You need to measure both layers.

The retrieval layer: what to measure

Contextual precision. Of the chunks your retrieval system returns, how many are actually relevant to the question? If you retrieve five chunks and three of them contain useful information while two are noise, your precision is 0.6. High noise in retrieved context forces the model to reason around irrelevant material, increasing the chance of an inaccurate synthesis.

Contextual recall. Of all the chunks in your knowledge base that contain the answer to a query, how many did your retrieval system find? Low recall means the model doesn't have access to the information it needs, regardless of how good the generation is. For document QA on a large corpus, recall is usually the harder metric to optimize.

Chunk relevance by query type. Retrieval quality often varies significantly across query types. Short, specific queries ("what is the refund policy?") tend to retrieve well. Long, ambiguous queries ("explain why my subscription might have changed") often don't. Breaking retrieval metrics down by query category tells you where to focus your embedding or chunking strategy improvements.

The generation layer: faithfulness comes first

For RAG specifically, faithfulness is the most important generation metric. It measures whether every factual claim in the response is supported by the retrieved context. A response that introduces information not present in the retrieved chunks has hallucinated — regardless of whether that information happens to be true.

This is a stricter definition than general hallucination detection. It treats the retrieved context as the ground truth, not the model's general knowledge. For applications where accuracy is a product requirement, this is the right definition: if the model is supposed to answer from a specific knowledge base, anything it adds from outside that base is a deviation from the intended behavior.

Answer relevance — does the response actually address the question — is the second generation metric worth tracking. A high-faithfulness response that doesn't answer the question is accurate but useless.

End-to-end evaluation vs. component evaluation

You need both. End-to-end evaluation — run the full pipeline, evaluate the final answer — tells you whether the system is working. Component evaluation — retrieval precision/recall, generation faithfulness — tells you why it isn't.

The failure diagnosis workflow looks like this: end-to-end score drops after a knowledge base update. You drill into component metrics. Contextual recall is down — the new documents aren't being retrieved consistently. The cause is a chunking strategy that doesn't work well for the new document format. The fix is in the retrieval layer, not the prompt.

Without component metrics, you'd know the system degraded but you'd have no path to the cause. Prompt engineering wouldn't fix it. You'd be debugging blind.

Building a RAG evaluation dataset

A good RAG evaluation dataset pairs queries with the specific chunks that contain the correct answer. This lets you calculate recall directly: did the retrieval system return the known-relevant chunk for this query?

Building this dataset requires knowing your knowledge base well enough to identify which chunks answer which queries. For large knowledge bases, this is done semi-automatically: generate synthetic queries from each chunk, verify which queries that chunk answers, and use that as ground truth for recall measurement.

The synthetic generation approach has a bias: synthetically generated queries tend to match the document content more directly than real user queries do. Supplement your dataset with real queries from your query logs as soon as you have them.

Monitoring in production

Retrieval quality can degrade without any code changes. A knowledge base update can introduce documents that pollute the embedding space. An increase in query diversity as your user base grows can expose retrieval blind spots. A change in your document ingestion pipeline can alter chunk boundaries in ways that break retrieval.

Ongoing monitoring means running a sample of real queries through your retrieval evaluation metrics on a regular schedule — not just at deployment time. A drop in contextual precision on live traffic is an early warning that something in the retrieval layer changed before users start complaining about answer quality.

← Back to Blog