G-Eval and Beyond: Modern Metrics for LLM Performance

For years, NLP evaluation relied on metrics like BLEU and ROUGE — statistical measures that compare n-gram overlap between generated text and a reference string. They were easy to compute and gave consistent numbers. They also correlated poorly with whether anyone found the output useful.

The shift to LLM-as-a-judge evaluation, formalized in frameworks like G-Eval, changed what's possible. Instead of comparing tokens, you use a model to evaluate quality against a rubric. The results correlate much better with human judgment — and for product teams, that alignment is what actually matters.

How G-Eval works

G-Eval's core mechanism is a chain-of-thought prompt that asks an evaluator model to score a response on a specific criterion. Rather than asking "rate this response 1-5," the prompt walks the evaluator through a reasoning process: consider the criterion, generate a judgment, then produce a score.

The reasoning step is what makes G-Eval scores more stable and more interpretable than direct scoring. The evaluator model has to articulate why a response scores where it does. That reasoning is often as useful as the score itself — it tells you what specifically caused a low score, which points to what to fix.

Scores are typically continuous rather than integer values, weighted by the evaluator's probability distribution over possible scores. This produces more differentiation between close cases than integer scoring alone.

The five metrics that cover most use cases

Coherence. Does the response flow logically from one point to the next? Is it internally consistent? This catches responses that technically answer the question but are disorganized or contradictory.

Faithfulness. For grounded tasks — document QA, RAG responses, summarization — does every claim in the output trace back to the provided source? This is the primary hallucination check for systems that have access to a ground-truth context.

Relevance. Does the response actually address what was asked? This sounds obvious but catches a common failure mode: responses that are accurate and well-written but tangentially answer a different question than the one posed.

Fluency. Is the response grammatically correct and natural-sounding? Lower-weight for most product use cases but important for customer-facing applications where rough outputs affect perceived quality.

Task completion. Did the model do what the user asked? This is a custom metric you define based on your specific task — for a coding assistant, it might mean "the code compiles and passes the specified tests"; for a support assistant, "the user's question is answered."

Where LLM-as-a-judge breaks down

The evaluator model has its own biases and failure modes. Known issues: evaluator models tend to prefer longer responses, even when shorter responses are better. They show positional bias — rating the first response in a comparison higher, regardless of content. They score outputs from their own model family more favorably.

These biases are manageable but need to be accounted for. Using a different model family for evaluation than for generation reduces self-preference bias. Randomizing response order in A/B comparisons neutralizes positional effects. For length bias, including an explicit rubric element that penalizes unnecessary verbosity helps.

Calibration matters too: what does a 7/10 actually mean for your rubric? Run a set of known-quality responses through the evaluator early to establish what the score distribution looks like and where your thresholds should sit.

Beyond G-Eval: task-specific metrics

G-Eval covers general quality dimensions. For specialized applications, you need task-specific metrics that capture what "good" means for your particular use case.

For RAG systems, contextual precision and recall: of the retrieved chunks, how many were actually relevant? Of the relevant chunks in the knowledge base, how many were retrieved? These measure the retrieval quality independently of the generation quality.

For code generation, functional correctness: does the code produce the expected output given specified inputs? This is deterministic and can run in CI without an LLM evaluation step.

For conversational agents, multi-turn coherence: does the model maintain consistent context across a conversation, remember what was established earlier, and not contradict itself across turns?

Combining metrics into a decision

Individual metrics produce individual scores. Your evaluation suite needs a way to combine them into a deployment decision. The simplest approach is a threshold per metric: a PR fails if any blocking metric drops below its threshold. More sophisticated setups weight metrics by importance and compute a composite score.

The composite score approach has a trap: a high score on easy metrics can mask a failure on a critical one. If faithfulness drops to 0.4 but your composite stays above threshold because coherence and fluency are high, you'll ship a response that hallucinates while being well-organized. Thresholds per metric prevent this.

Build your metric stack incrementally. Two metrics you understand and calibrate well are more valuable than seven metrics you're not sure how to interpret.

← Back to Blog