RAG Evaluation: Measuring What Actually Matters

By the Confident AI Engineering Team · 13 min read

Retrieval-augmented generation has become the default architecture for production knowledge applications, and almost every team building RAG systems evaluates them incorrectly. The problem is not that teams skip evaluation entirely; it is that they measure end-to-end answer quality and stop there, leaving the two biggest sources of failure completely unmeasured.

A RAG pipeline has three distinct failure modes: the retriever surfaces wrong documents, the generator ignores the retrieved documents and hallucinates anyway, or the generator faithfully reproduces retrieved content that was itself incorrect. End-to-end accuracy metrics blend all three failures into a single number that tells you something went wrong but not what or where.

The Five Metrics That Cover the RAG Failure Space

Effective RAG evaluation requires metrics that decompose failure across the pipeline. These five cover the failure space without redundancy:

1. Contextual Recall. Does the retriever surface the documents that contain information necessary to answer the query? Measured by asking a judge model whether each retrieved document is relevant to the query. Low recall means the generator is forced to answer from parametric knowledge, which increases hallucination risk. Target: >0.75 for most knowledge assistant applications.

2. Contextual Precision. What fraction of retrieved documents actually contain relevant information? A retriever with high recall but low precision is surfacing noise along with signal. The generator then has to distinguish relevant from irrelevant context, which degrades answer quality. A low-precision retriever is often more dangerous than a low-recall one because it gives the generator confident-sounding but incorrect material to work with.

3. Faithfulness. Are the claims in the generated answer grounded in the retrieved documents? Faithfulness is the key anti-hallucination metric for RAG. A faithfulness score of 0.92 means 92% of verifiable claims in the output can be traced to a specific retrieved document. The remaining 8% are generated from parametric memory or fabricated. This is the metric most teams skip and the one most correlated with production hallucination incidents.

4. Answer Relevancy. Does the answer actually address what the user asked? A high-faithfulness answer can still be irrelevant if the generator answers a related but different question than was posed. This metric catches cases where the model technically says something accurate but sidesteps the actual query.

5. Answer Correctness. For queries with known correct answers, is the answer factually right? This requires ground truth labels, which makes it the most expensive metric to compute but also the most directly meaningful for user trust.

The Common Mistake: Using BLEU and ROUGE for RAG

Teams with NLP backgrounds often reach for BLEU and ROUGE scores because they are familiar and easy to compute. For RAG evaluation, they are close to useless. Both metrics measure lexical overlap with a reference answer. An LLM that paraphrases the correct answer perfectly will score near zero on BLEU. A response that copies large chunks of text from the source documents will score high even if it contains injected hallucinations.

Semantic similarity metrics like BERTScore are better than BLEU/ROUGE but still operate on the end-to-end response, not the retrieval-generation pipeline decomposition. They are appropriate as an additional signal, not as the primary evaluation method.

Evaluating Retrieval Quality Independently

Retrieval and generation should be evaluated independently before they are evaluated together. This means building a retrieval evaluation dataset where each query has labeled relevant documents, distinct from your generation evaluation dataset where each query has a labeled expected output.

Separate evaluation makes it possible to diagnose whether a degradation in end-to-end quality comes from the retriever (changed embedding model, corpus update) or the generator (system prompt change, model version update). Without this separation, you spend debugging time asking the wrong questions.

For retrieval evaluation, track Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG) alongside the RAG-specific contextual recall and precision metrics. A sudden drop in MRR after a corpus update tells you the retriever is struggling with the new content before it degrades end-to-end answer quality.

Handling Knowledge Base Updates

Knowledge base updates are underappreciated as a source of RAG quality regression. When documents are added, removed, or modified, the retriever's behavior changes without any change to the application code. A new document that contradicts a previous source can cause faithfulness and correctness to drop. A deleted document that was frequently retrieved can cause recall to collapse for a specific query category.

The practical implication: knowledge base updates should trigger evaluation runs just as code deploys do. Track which queries' performance changes after each corpus update. Queries that newly fail after an update are pointing at document-level conflicts or gaps that need content review, not code fixes.

A Practical Baseline for RAG Evaluation

If you are setting up RAG evaluation for the first time, this is a viable starting configuration that will surface the most common failure modes:

Dataset size: 50-100 queries, mix of factual and reasoning tasks representative of real user queries
Metrics: faithfulness (primary), contextual recall, answer relevancy
Gate thresholds: faithfulness >0.85, recall >0.70, relevancy >0.80
Run triggers: every deploy, every corpus update >5% change in document count
Baseline capture: run on day one, compare all subsequent runs against baseline delta

Faithfulness is the non-negotiable starting point. If you can only instrument one metric, instrument faithfulness. An application with high faithfulness has dramatically lower hallucination risk regardless of the retrieval quality metrics, because the model is at least staying grounded in retrieved content. Everything else is optimization from there. As we covered in the hallucination rate analysis, faithfulness in controlled evaluation does correlate with production hallucination rates when the evaluation dataset is representative.

Confident AI supports all five RAG metrics natively with no custom evaluator configuration required. See the RAG evaluation setup guide or start a free trial.

← Back to Blog

Ready to Improve Your LLM Quality?

Start with the free Developer plan. No credit card required, and you'll have your first evaluation running in under 10 minutes.

Start Free Trial Explore Platform