How a B2B SaaS Team Reduced Chatbot Hallucinations by 78% in Six Weeks
By the Confident AI Customer Success Team · 12 min read
The team lead's description of the problem was specific: "We know the chatbot is hallucinating because support tickets reference answers the bot gave that we cannot find anywhere in our documentation. We do not know how often it happens, where in the conversation it happens, or why." That specificity is what made the engagement solvable. They knew what was wrong; they just had no instrumentation to measure it, locate it, or track whether fixes were working.
This is a composite case study representing patterns from multiple customer engagements, with specific numbers anonymized. The problem structure, implementation approach, and results are representative of what we see across B2B SaaS customer support applications.
The Starting State
The application: a customer support chatbot for a B2B SaaS platform, built on GPT-4 with a RAG pipeline indexing the product documentation, API reference, and support KB articles. Approximately 400 queries per day, handled by the bot with human escalation for unresolved cases.
Known symptoms: support ticket escalations citing bot answers that did not match documented behavior. Approximately 8-12 tickets per week with this pattern. Estimated hallucination rate based on ticket volume: somewhere between 5% and 25%, with no precise measurement.
Previous attempts: the team had modified the system prompt three times in the previous month, each time trying to reduce hallucination by adding more explicit instruction. Each modification had changed the behavior but not clearly improved the hallucination rate, because there was no measurement.
Week 1: Establishing the Baseline
The first task was getting a precise measurement of the actual hallucination rate. We built an initial evaluation dataset of 80 queries by sampling from two weeks of production logs, prioritizing queries that had resulted in escalations and queries that covered the product areas with the most documentation depth.
Running the full evaluation suite against the current system revealed a faithfulness score of 0.73 and an answer correctness score of 0.71 on the sampled dataset. The hallucination rate on faithfulness-scored queries was 18.4%. Higher than the team suspected, and concentrated in a specific query category: questions about feature capabilities that were either deprecated or had changed in the most recent product release.
The RAG corpus had not been updated for the most recent release. The hallucination was not a model problem; it was a retrieval problem. The model was giving accurate answers to questions about the previous product version because that was what the corpus contained.
Week 2-3: Root Cause and Initial Fixes
Updating the RAG corpus with current documentation reduced the hallucination rate on the corpus-related queries from 18.4% to 9.1%. Progress, but not yet acceptable. The remaining hallucinations fell into two categories visible in the trace-level evaluation data: queries where no retrieved document contained the relevant information (the model was answering from parametric knowledge) and queries where multiple retrieved documents gave conflicting information (the model was choosing one version and presenting it with false confidence).
The fix for the first category was adding an explicit instruction to the system prompt: when retrieved context does not contain information sufficient to answer the query, say so rather than inferring. We verified this instruction's effectiveness by running the evaluation suite on a sample of no-context queries before and after the change.
The fix for the second category required documentation work: identifying the five specific topics where KB articles contained contradictory information and resolving the contradictions. This was a content problem, not a model problem, and it would not have been diagnosed without the trace-level faithfulness analysis that showed which retrieved documents were contributing to which responses.
Week 4-6: CI/CD Gates and Continuous Monitoring
With the baseline hallucination rate reduced to 4.7%, we implemented evaluation gates in the deployment pipeline. Any change to the system prompt, RAG corpus, or model version triggers a full evaluation run. Deploys are blocked if faithfulness drops below 0.87 or hallucination rate rises above 6%.
In weeks 5 and 6, the gates caught two regressions that would previously have reached production: one from a system prompt change that unintentionally narrowed the model's scope of acceptable answers (causing more no-answer responses, which elevated user frustration), and one from a corpus update that introduced a new documentation page with conflicting pricing information.
Both were caught and fixed before reaching production. Total gate blocking time: under 90 minutes combined. Neither resulted in a user-facing incident.
Results at Six Weeks
- Hallucination rate: 18.4% to 4.1% (78% reduction)
- Escalation tickets citing bot errors: 8-12/week to 1-2/week
- Mean time to detect quality regression: 24-48 hours (manual) to <15 minutes (automated)
- Deploy pipeline regressions caught before production: 2 in the monitoring period
The most important outcome was not the specific numbers but the organizational shift: the team now has a precise, reproducible way to answer the question "did this change make the chatbot better or worse?" That measurement capability is what makes all subsequent improvements possible. Without it, every system prompt change is a guess. With it, every change is an experiment with a measurable outcome.
If you recognize your application in this case study, contact our customer success team to discuss a similar implementation. Or start a free trial and run your first evaluation suite today.
Ready to Improve Your LLM Quality?
Start with the free Developer plan. No credit card required, and you'll have your first evaluation running in under 10 minutes.