Teams that optimize LLMs on a single accuracy metric often end up with models that score well on their evaluation dataset while performing poorly in production. The reason is almost always the same: accuracy measures whether outputs are correct, but users experience many more dimensions of quality than correctness alone.
This guide covers the metrics framework we have found most useful for LLM evaluation across production applications — including dimensions that are frequently overlooked and the practical approaches to measuring each one.
Why Accuracy Alone Is Insufficient
Consider a customer service LLM evaluated only on whether it answers questions correctly. The model achieves 92% accuracy on your test set. It ships. Users complain. Why?
Because it answers correctly but at three times the necessary length, burying the key information in filler text. Because it uses a tone that feels cold and robotic compared to what customers expect from your brand. Because it sometimes answers with technically correct information that is not actually responsive to what the user asked. Because on 8% of questions it gives wrong answers, and those wrong answers happen to be the ones that damage trust most severely when they occur.
Accuracy alone tells you whether the model is right. It tells you nothing about whether the model is useful.
The Core Quality Dimensions
Faithfulness / Grounding. Does the output accurately reflect the information in the provided context, without introducing unsupported claims? Faithfulness is the primary metric for RAG applications and any system that generates outputs based on provided documents. A model that scores well on accuracy but poorly on faithfulness is generating plausible-sounding additions to the provided context — a form of hallucination even when the core answer is correct.
Measurement approach: Natural language inference (NLI) scoring against source documents. For each claim in the output, determine whether it is entailed by, neutral relative to, or contradicts the source material. The ratio of entailed claims to total claims gives you a faithfulness score.
Relevance. Does the response actually address what the user asked? This sounds basic but is frequently violated in subtle ways — models often partially address a question while missing its core intent, or answer a related question that was easier to answer. Relevance evaluation requires semantic understanding of query intent, not just keyword overlap.
Measurement approach: Cosine similarity between question embedding and answer embedding captures surface-level relevance but misses intent-level mismatches. LLM-as-judge evaluation with a relevance rubric is more reliable for nuanced cases. Measuring user engagement (did the user ask a follow-up indicating the answer was unsatisfying?) is a useful complementary signal from production.
Completeness. Does the output include all the information necessary to be useful? A response that is technically accurate but omits critical information may be worse than no response at all. For summarization tasks, completeness measures whether key information from the source was included. For Q&A tasks, it measures whether all aspects of a multi-part question were addressed.
Measurement approach: For summarization, recall-oriented metrics (ROUGE-2 recall, BERTScore F1) capture whether key content was included. For structured tasks with defined required outputs, schema validation against required fields is most reliable. For conversational Q&A, LLM-based completeness scoring against a rubric.
Coherence and Logical Consistency. Is the response internally consistent and logically structured? Does it contradict itself? Does the reasoning flow in a way that makes sense? A model can produce responses that are individually accurate sentence by sentence but collectively incoherent as a complete response. This failure mode is more common on longer outputs and on complex reasoning tasks.
Measurement approach: Coherence is one of the harder dimensions to measure automatically. Trained coherence classifiers exist for some domains. For general-purpose applications, LLM-as-judge scoring against a coherence rubric ("does this response follow a clear logical structure without internal contradictions?") is practical. Discourse coherence metrics from NLP research (entity grid models, RST-based metrics) offer more rigorous measurement for structured text generation tasks.
Tone and Register Adherence. Does the model maintain the appropriate tone for your application and user base? Tone mismatches are particularly noticeable to users even when they cannot articulate why a response feels wrong. A formal tone in a casual consumer app, or an overly casual response in a professional financial context, both erode trust even when the content is correct.
Measurement approach: Fine-tuned sentiment and formality classifiers can detect obvious tone deviations. For more nuanced brand voice adherence, LLM-as-judge with a specific tone rubric based on your brand guidelines performs better. Collecting user satisfaction ratings stratified by user segment helps validate that your tone metric captures what users actually care about.
Conciseness and Format Compliance. Does the response fit within expected length bounds? Does it follow format requirements (markdown, bullet lists, specific response structures)? Format violations often have measurable downstream consequences — responses that are too long get truncated or ignored; responses that violate expected structure break downstream parsing.
Measurement approach: Length bounds and format validation are rule-based and straightforward to implement. Token count distributions across your test cases reveal whether a model update has made responses systematically longer or shorter.
Operational Quality Metrics
Beyond the linguistic quality dimensions, operational metrics measure whether the model is working well as a production system component.
Latency. P50 and P95 latency matter differently. P50 latency determines the typical user experience; P95 determines whether the long tail of slow responses creates support tickets. Track both, and track how they change across model updates — a more accurate model that is 40% slower may not be a net improvement for your use case.
Cost per output. Cost per evaluation run is a direct operational metric. Track cost alongside quality scores so you can make informed trade-offs when comparing model versions or providers. A model that is 5% more accurate but 3x more expensive deserves a cost-benefit analysis, not automatic adoption.
Error and retry rate. API errors, timeouts, and content policy refusals are quality failures even though they are not output quality issues. A model that refuses to answer 3% of legitimate queries has an effective 97% availability on your use case — include that in your quality reporting.
Building a Composite Quality Score
With multiple quality dimensions tracked, you need a way to synthesize them into actionable release decisions. A weighted composite score — where each dimension's weight reflects its importance to your specific application — is more useful than tracking each metric independently for go/no-go decisions.
Set dimension weights based on what matters most for your users, not on which dimensions are easiest to measure. For a medical information application, faithfulness and safety should carry the highest weights. For a creative writing assistant, coherence and tone adherence matter more than strict factual accuracy.
Use the composite score for trend monitoring and initial regression detection. Use individual dimension scores for root cause analysis when a regression is detected. This combination gives you both the high-level signal you need for operational decisions and the diagnostic detail you need to act on them.
Confident AI tracks all quality dimensions in a unified dashboard.
Accuracy, faithfulness, relevance, coherence, tone adherence, and operational metrics — all versioned and comparable across model runs. See the metrics dashboard →
Track Every Quality Dimension
Confident AI measures the full spectrum of LLM quality — not just accuracy.