Hallucination Detection: How to Catch What Your AI Gets Wrong

A hallucination isn't a model acting unpredictably. It's a model acting very predictably — confidently producing text that sounds right but isn't. The problem isn't that models occasionally make things up. The problem is that they make things up in ways that are hard to spot without a systematic check.

For low-stakes use cases, this is annoying. For anything touching financial data, legal documents, medical information, or customer support for a real product, it's a liability. Detecting hallucinations before users do is not optional at that point.

What hallucinations actually look like

There's a tendency to think of hallucinations as obvious — the model invents a fictional statistic or attributes a quote to the wrong person. Those do happen. But the harder category is subtle hallucinations: answers that are almost right, responses that blend real and invented facts, summaries that accurately convey the gist but quietly change a specific number.

In a RAG system, hallucinations often look like this: the model retrieves a document, summarizes it, but introduces a detail that wasn't in the source. The answer is grounded in real content — just not accurately. This is the failure mode that automated string matching misses entirely.

Three types of detection

Not all hallucination detection works the same way. Which approach you use depends on your system architecture and what kind of accuracy you need.

Source-grounding checks. If your model is supposed to answer only from a provided context — a document, a knowledge base, a set of retrieved chunks — you can check whether every factual claim in the response traces back to that source. This works well for RAG systems and document QA. It's more precise than general hallucination scoring because it defines "hallucinated" as "not present in the provided context."

Factual consistency scoring. For models that draw on general knowledge rather than a specific source, you need a different approach. This typically involves using a second model to evaluate whether the response is internally consistent and whether specific claims hold up against verifiable facts. LLM-as-a-judge setups can do this at scale, though they introduce their own failure modes.

Contradiction detection. A lighter-weight check: does the model contradict itself within a single response, or contradict something it said earlier in the conversation? This doesn't catch all hallucinations but catches a class of errors that often indicates a model struggling with the input.

Building detection into your pipeline

Running hallucination detection manually on a sample of outputs is better than nothing, but it doesn't scale and it doesn't catch regressions. The goal is to have detection running automatically, on every output you care about.

For evaluation pipelines, this means adding hallucination metrics to your test suite alongside other metrics like relevance and conciseness. When you run evaluations on a new model version or a prompt change, hallucination scores should be part of the report.

For production monitoring, it means sampling live outputs and running them through your detection layer continuously. You're not catching individual hallucinations before they reach users — you're catching patterns that indicate your model has drifted into a higher hallucination rate.

Setting thresholds that make sense

Detection without thresholds is just metrics. You need to decide what score means "pass" and what means "fail." This is domain-specific and there's no universal right answer.

A medical information tool should have a very low tolerance for hallucinations — maybe zero tolerance for claims about dosages or contraindications. A creative writing assistant can tolerate much more because the outputs aren't meant to be factually precise. A customer support bot falls somewhere in between: it needs to be accurate about your product and policies, but not every turn of phrasing needs to be verifiable.

Start with a threshold that blocks obvious failures and tighten it over time as you understand your system's behavior. A threshold you actually enforce is more valuable than a perfect threshold you never set.

The limits of current detection

Hallucination detection is not a solved problem. Current approaches have real limitations: LLM-as-a-judge evaluation introduces the evaluator's own biases and occasional errors. Source-grounding checks don't work for questions that require reasoning beyond what's in the context. Factual consistency scoring can miss hallucinations phrased carefully enough to avoid obvious contradictions.

Knowing these limits matters because it shapes how you use the results. Hallucination scores are signal, not ground truth. A score of zero doesn't mean the model never hallucinates — it means none of your current test cases triggered it. Your test suite needs to keep growing as you see new failure patterns in production.

Starting point for most teams

If you're using a RAG system, start with faithfulness scoring: does each response stay within the bounds of the retrieved context? This is the highest-leverage check for most document QA and customer support applications.

If you're not using RAG, start with a small set of known-answer questions that cover your most sensitive topics. Run them on every model change. Track the scores over time. If a change makes things worse, you'll see it before it ships.

The goal isn't to eliminate hallucinations completely — that's not possible with current models. The goal is to catch the ones that matter, at the scale you need to operate at.

← Back to Blog