October 31, 2025
Every team shipping AI features has some version of a pre-deployment checklist, even if it's informal. "We tested a bunch of prompts. It looked fine. We shipped it." That's a checklist of one item: looked fine.
Teams that run reliable AI systems in production have a more structured version. Not because they're more cautious — they ship just as fast — but because they've learned which checks actually catch failures before users see them, and they don't skip those checks under deadline pressure.
Here's a checklist based on what those teams actually do.
Define the failure modes. Write down the three to five ways this feature could fail that would be worst for users. Not a general "might give wrong answers" — specific cases: "tells users their account balance is zero when it isn't," "reveals information about other users," "refuses to complete a valid task." You'll write test cases for these.
Set quality thresholds in advance. What score on your evaluation metrics means this is ready to ship? Define this before you see the scores. If you define thresholds after seeing results, you'll rationalize away numbers that should block the release.
Identify the evaluation dataset gap. Do you have test cases for this feature? If not, build at least 20 before writing the first production prompt. Prompts evolve during development; your evaluation data should lead, not follow.
Run eval on every significant prompt change. "Significant" means any change to the system prompt structure, new instruction added, retrieval configuration modified, or model version changed. Minor wording tweaks can have non-obvious effects; the only way to know is to run the suite.
Test the edge cases, not just the expected cases. Your happy-path test cases will almost certainly pass — you wrote the prompt to handle them. The interesting failures happen at the boundaries: empty inputs, extremely long inputs, inputs in unexpected languages, inputs that are ambiguous between multiple valid interpretations.
Verify refusal behavior. For any feature with content restrictions, explicitly test that the restrictions work. Don't assume the system prompt is sufficient — test it. Run at least five adversarial inputs per restriction and verify the model handles them correctly.
Full evaluation suite passes. All blocking metrics above threshold. No exceptions for "it'll be fine in production" or "that test case is unrealistic." If the test case is unrealistic, remove it from the suite with a documented reason. If it's realistic, fix the failure.
Hallucination rate within bounds. For any feature that generates information users might act on, run a targeted hallucination evaluation against your domain data. The result should be at or below your defined threshold. If you haven't defined one, 5% is a reasonable starting point for most product applications — lower for anything involving money, health, or legal information.
Safety checks complete. If your feature handles sensitive topics, run your safety evaluation category. Verify that out-of-scope requests are handled gracefully — the model declines appropriately rather than producing harm or revealing system internals.
Staging environment evaluation. Run your full evaluation suite against the staging environment, not just your development setup. Infrastructure differences, latency, and environment-specific configurations can all affect model behavior in ways that don't show up in local testing.
Latency profiled. Measure the 95th percentile response time under realistic load. Slow AI responses are a UX failure mode that won't show up in quality evaluation but will affect user experience immediately.
Rollback plan documented. What specifically happens if quality degrades in the first 24 hours? Which signals trigger a rollback? Who makes the call? Write this down before you deploy, not during an incident.
Monitor hallucination rate on live traffic. Sample 5-10% of live outputs and run them through your hallucination detection. Compare to your pre-launch baseline. A significant increase in the first 48 hours is a signal that real user query patterns are different from your test cases.
Collect failing examples. Any output that gets a user complaint or a negative feedback signal should go directly into your evaluation dataset. These are your best test cases — real failures on real inputs that slipped past your pre-launch checks.
This checklist doesn't make shipping slower. It makes it safer. Most of these steps take minutes once they're part of your workflow. The value is in the ones that catch something — and they will.