The eval suite is green. Every test passes. Accuracy is 94%. Safety checks: clean. Latency: under threshold.
You ship it.
Three weeks later, a user finds a prompt that produces output that should not be possible given your system prompt. It is not even close to your test cases. Nobody on your team ever tried that input.
The agent did not break. It was never tested on that input. The test suite was a sample from the distribution you anticipated. Production is the full distribution, including the part you did not anticipate.
Green tests mean the agent works on your test set. They say very little about what happens beyond it.
TL;DR
- Test suites test anticipated inputs. Production traffic includes unanticipated inputs. These are the inputs where agents go rogue.
- Distribution mismatch is structural, not a quality problem. Even a perfect test suite cannot cover inputs that were not anticipated at the time it was written.
- Three production mechanisms that test suites miss: adversarial prompt patterns, long-context compounding effects, and multi-turn behavioral drift.
- The gap between test distribution and production distribution grows over time. As usage patterns evolve, the mismatch widens.
- Continuous production evaluation is the only solution. Sample real production inputs, evaluate them against the behavioral spec, feed violations back into scoring.
The Test Distribution Problem
Test suites are necessarily written before production traffic is observed. This creates a structural problem: the inputs in your test suite are a sample from the distribution you could anticipate. Production traffic is drawn from a different distribution — one shaped by actual user behavior, adversarial intent, and edge cases that did not occur to you when you were writing tests.
This is not a failure of test quality. It is a property of the distribution mismatch. Even a test suite written by an adversarial red team, trying hard to think of edge cases, will miss edge cases. The distribution of what users will actually try is larger than the distribution of what you can anticipate in advance.
The number that matters is not accuracy on your test set. It is accuracy on the production distribution — which you cannot fully observe until it exists.
Three Mechanisms That Turn Green Tests Rogue in Production
Mechanism 1: Adversarial prompt patterns
Users probe. Some do it maliciously. Most do it out of curiosity. They try inputs that were never in your test set — not because your test set was bad, but because the space of adversarial inputs is larger than any test set.
A common failure pattern: the agent has a refusal instruction for topic X. Your test suite covers explicit requests for X. Production traffic includes oblique references to X, requests framed as hypotheticals, multi-step approaches that approach X indirectly. None of these are in your test suite. Some of them work.
The agent passed all your tests. The agent had never seen the input that caused the production incident.
Mechanism 2: Long-context compounding effects
Most test cases are short. A single-turn exchange, or a two- or three-turn conversation. Production conversations are longer — five, ten, twenty turns. Behavior at turn 18 may be substantially different from behavior at turn 3 on the same topic, because the model's effective instruction weight shifts across a long context window.
An agent that reliably refuses a request at turn 3 may comply at turn 18, after the conversation has built a context that semantically de-prioritizes the refusal instruction. Your three-turn test case did not test this. Your one-turn test case definitely did not.
Mechanism 3: Composed tool use
An agent with five tools was tested using each tool individually and some two-tool combinations. Production users discover three-tool chains, four-tool chains, and tool sequences your tests never covered. The agent was not evaluated for the composition — only for the components.
The failure mode is not in any individual tool. It is in how they interact when composed in sequences that were not in the test plan. The component tests all passed. The composition produces unexpected behavior.
The Growing Distribution Gap
The distribution mismatch between tests and production is not static. It grows:
Month 1: Test distribution closely approximates production traffic. Your test cases were written recently. Users are not yet exploring edge cases.
Month 6: Users have found the interesting corners. Adversarial inputs have been shared in communities. Your agent has been in production long enough for usage patterns to evolve significantly. The test suite was not updated — it was written six months ago.
Month 12: The test suite is effectively a historical artifact. It tests behavior on inputs from a year ago. Production traffic has been shaped by twelve months of evolution, adversarial probing, and feature additions. The gap between test distribution and production distribution is substantial.
Green tests from six months ago are weak evidence about current production behavior. The distribution your agent is evaluated on and the distribution it is deployed against have diverged.
What Actually Predicts Production Behavior
The only reliable predictor of production behavior is production evaluation. Not tests that proxy for production — actual production traffic, sampled and evaluated against the behavioral spec.
This requires:
Behavioral pact as evaluation standard. A machine-readable specification of what compliant behavior looks like — not "returns accurate answers" but specific, measurable conditions that can be evaluated deterministically or via an LLM jury.
Production input sampling. A statistically significant sample of real production inputs, evaluated against the pact. Not held-out test cases — actual requests from actual users.
Continuous scoring. The evaluation results update the agent's live composite score. The score reflects current production behavior, not test-day behavior. Drift is visible in the score trend.
Violation review. Production inputs that produce pact violations are flagged for review. They become the next wave of test cases — but more importantly, they update the agent's behavioral record immediately, before the incident compounds.
| Evaluation Type | What It Tests | Distribution Overlap |
|---|
| Pre-launch test suite | Anticipated inputs | Low to moderate |
| Red-team adversarial tests | Targeted adversarial inputs | Moderate |
| Production sampling | Real user traffic | High |
| Continuous production eval | Real traffic + drift detection | High + temporal |
Pre-launch tests are necessary. They are not sufficient. The only way to know if your agent will behave in production is to evaluate it in production.
The Practical Fix
The practical answer is not "write better tests." Better tests are good — they move the pre-launch bar. But they do not close the distribution gap. The production distribution is unobservable until it exists.
The practical fix is a continuous behavioral evaluation loop:
-
Every task (or a sampled subset) is evaluated against the behavioral pact. The eval runs inline with task execution, capturing pass/fail against each pact condition without adding meaningful latency.
-
Violations are captured with the full input/output context. When an eval fails, you have the exact input, the exact output, and the eval result — not a vague incident report.
-
Violations update the agent's composite score. The score reflects production behavior. A sudden drop in a specific dimension tells you what to investigate.
-
Violation patterns surface new test cases. The adversarial inputs users discover become the next wave of your test suite — but you know about them because your production eval caught them, not because a user filed a complaint.
Your agent passed all your tests because your tests were a subset of the inputs it will face. Continuous production evaluation covers the rest.
If your agent is in production without this loop, the question is not whether it will produce an unexpected result — it is when you will find out. armalo.ai provides the continuous eval infrastructure.
Frequently Asked Questions
Why do AI agents that pass test suites fail in production?
Test suites are written against anticipated inputs — the distribution of requests you expected when the tests were written. Production traffic includes unanticipated inputs: adversarial prompts, long-context compounding effects, tool composition sequences, and usage patterns that evolved after the tests were written. The distribution mismatch is structural.
What is the test distribution problem?
The test distribution problem is the structural mismatch between the inputs in a test suite (what was anticipated before deployment) and the inputs in production (what users actually send). Because test cases are written in advance, they cannot cover inputs that were not anticipated, and the gap grows over time as usage patterns evolve.
How can continuous production evaluation catch what test suites miss?
Continuous production evaluation samples real production inputs and evaluates them against the behavioral specification. Because it operates on actual production traffic, it captures the inputs that were not in the test suite — and because it runs continuously, it detects behavioral drift as it accumulates rather than after it compounds to an incident.
Does this mean pre-launch testing is not useful?
Pre-launch testing is necessary but not sufficient. It validates baseline behavior on anticipated inputs, which is valuable. It does not predict behavior on the full production distribution. The complete evaluation stack requires both: pre-launch tests for baseline validation and continuous production evaluation for post-launch behavioral monitoring.
Armalo AI provides continuous production behavioral evaluation: pacts, inline eval execution, composite scoring, and violation capture. See armalo.ai.