Live Production Eval: Sampling Real Traffic Without Slowing It Down
Lab evals lie about production. Live sampling is the only way to know how an agent really behaves. Here is the sample-and-shadow pattern, the latency budget, and the sampling plan that makes it work.
Continue the reading path
Topic hub
Agent EvaluationThis page is routed through Armalo's metadata-defined agent evaluation hub rather than a loose category bucket.
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
TL;DR
Lab evaluations measure how an agent performs on the cases you wrote. Production evaluations measure how the agent performs on the cases your customers actually send. The two are systematically different, and the difference is rarely flattering. Live production evaluation closes the gap by sampling real traffic and running it through evaluators in a way that does not slow down the customer-facing path. This essay defines the sample-and-shadow pattern, walks through the four decisions every live eval program has to make (sample rate, shadow vs inline, latency budget, evidence retention), and presents a Live Eval Sampling Plan template. The point is not to replace lab evaluation but to add the production telemetry that catches what lab evaluation cannot see.
The lab said the agent was great. Production disagreed.
The agent that prompted this essay scored consistently above 90 in lab evaluations. Six months of weekly eval runs, all of them in the green. The team was confident. The buyer was confident. The trust oracle was returning a Gold-tier score.
Then we started sampling production traffic. We picked one in fifty live customer interactions, captured the full evidence (input, agent output, tool calls, system context), and ran them through the same eval pipeline that the lab suite used. The first week's results were not Gold. They were not Silver. The aggregate score on production-sampled traffic was 71.
The gap between 90 in lab and 71 in production was not a measurement error. It was a structural truth. The lab cases were carefully curated, with clean inputs and well-formed contexts. The production cases were not. They had typos, ambiguous intents, missing fields, multi-turn context the lab cases did not exercise, edge formats the lab cases did not anticipate. The agent handled the lab cases well because the lab cases were designed to be handle-able. The agent struggled with the production cases because production was messier than the lab.
None of this would have been visible without the live sampling. The lab suite would have continued to report 90s. The agent would have continued to be sold as Gold-tier. Customers would have continued to be quietly underserved. The buyer would have eventually figured it out and the trust oracle's score would have been quietly discredited as a fiction.
Live production evaluation is the discipline that prevents this. It samples real traffic, runs it through the evaluator, and reports an empirical score that reflects actual customer experience rather than curated test cases. It is more expensive operationally than lab evaluation, more constrained by latency budgets, and more politically delicate (operators do not love seeing production scores below their lab scores). It is also the only way to produce scores that are not fiction.
This essay walks through how to do it well. The mechanics are not exotic but they require care. Done badly, live evaluation either slows production or produces biased samples that look like measurement but are not. Done well, it transforms the trust layer from a lab artifact into a production-grounded signal.
The four shapes of live evaluation
There are four ways to integrate evaluation with live traffic, and they make different trade-offs. Understanding the trade-offs is prerequisite to picking the right shape for your situation.
The first shape is full inline evaluation. Every customer interaction goes through the evaluator before the response is returned to the customer. The evaluator's verdict either gates the response (block bad outputs) or annotates it (let everything through with a quality flag). This is the most rigorous shape and the most operationally heavy. Latency goes up by however long the evaluator takes; cost scales with the full traffic volume; the evaluator is on the critical path of every customer interaction.
The second shape is full shadow evaluation. Every customer interaction is sent to the customer normally; a copy is also sent to the evaluator asynchronously, off the critical path. The evaluator's verdict does not block or modify the customer-facing response; it just produces a verdict that gets recorded for analysis. This is operationally lighter than inline (no latency impact on customers) but still scales cost with full traffic volume.
The third shape is sampled shadow evaluation. A fraction of interactions are shadowed to the evaluator; the rest are not evaluated at all. The sampling reduces both cost and infrastructure load. The trade-off is that the evaluator only sees a subset of traffic; rare but important interactions might be missed in the sample.
The fourth shape is sampled inline evaluation. A fraction of interactions go through inline evaluation; the rest pass through normally without evaluation. This shape is rare because the cost of having two paths (one slow with eval, one fast without) is operationally complex and the latency hit on the sampled fraction is visible to customers.
For most agents, the right shape is sampled shadow. Inline evaluation is overkill for non-safety-critical operations. Full shadow is too expensive. Sampled inline introduces customer-visible latency variance. Sampled shadow combines the benefits β low operational impact, full quality measurement, manageable cost β with no customer-facing latency penalty.
For safety-critical operations (financial transactions, medical recommendations, anything with regulatory exposure), the right shape may be inline evaluation on the critical paths and sampled shadow on the rest. The latency cost of inline evaluation is justifiable when the consequence of a bad output is severe enough to gate.
The shape choice cascades into every other decision in the program. Get the shape right before you optimize anything else.
Sample rate: how much is enough
If you have settled on sampled shadow, the next decision is sample rate. What fraction of production traffic gets shadowed?
The naive answer is "as much as the budget allows." This produces a sample rate that fluctuates with budget and a measurement program that is not statistically grounded. Better to think about sample rate from a measurement-quality perspective and let the budget set the lower bound.
The measurement-quality perspective starts with the question: what verdict precision do you need? If you need to know your agent's aggregate quality score within plus or minus one point at 95 percent confidence, you need a sample large enough to support that precision. The relevant statistics are straightforward: for typical score distributions, a sample of a few thousand cases produces verdict precision in the one-point range. Larger samples produce tighter precision; smaller samples produce wider precision.
A practical rule of thumb: target a weekly sample of at least 1,000 cases for any agent whose composite score is being publicly reported. This produces verdict precision in the one-to-two point range, which is enough to detect meaningful changes from week to week. Agents with very low traffic volume may not be able to hit this target without inflating sample rate to high fractions; agents with very high traffic volume can hit it with sample rates well below one percent.
The sample rate to hit the target is the weekly sample target divided by the weekly traffic volume. An agent doing 10,000 interactions per week needs roughly 10 percent sampling to hit 1,000 evaluated cases. An agent doing 1,000,000 interactions per week needs roughly 0.1 percent. The sample rate adapts to traffic.
Beyond aggregate verdict precision, you also need per-category precision. The aggregate score might be tightly bounded but the score for a specific capability cluster might be loosely bounded if that cluster is rare in traffic. The Coverage Heatmap from the prior essay is the relevant tool: per-cell coverage in production sampling has to support per-cell verdicts, not just the aggregate.
This usually requires stratified sampling. Uniform random sampling over-represents common capability clusters and under-represents rare ones. Stratified sampling adjusts the sampling rate per cluster to ensure each cluster gets enough cases for confident per-cluster verdicts. The total sample size grows but the per-category verdicts become reliable.
The budget bound: sample rate Γ traffic volume Γ evaluation cost per case = monthly evaluation budget. With a target sample of 1,000 cases per week and per-case eval cost from the Cost Model, the monthly cost is calculable. If the cost exceeds budget, you either tighten the per-case cost (see eval cost engineering essay) or accept lower verdict precision (smaller sample).
Do not skip the precision calculation. A sample rate chosen by intuition tends to under-sample for the precision you actually want. Calculate from the precision target backward.
Latency budget: the inline case
If you are doing any inline evaluation (full or sampled), you are spending latency. Customers wait while the evaluator runs. The latency budget governs how much of that you can spend before customers notice.
The latency tolerance is application-specific. A chat interface has a latency budget that depends on the tolerance of conversational interaction; users will tolerate seconds but not minutes. A real-time trading system has a latency budget measured in milliseconds. A batch document processing system has a latency budget measured in minutes or hours. The evaluator must fit in the available latency without making the application feel slow.
The evaluator's latency depends on its panel composition (frontier judges are slower than efficient judges), the case complexity (longer evidence takes longer to process), and the parallelism (multiple judges in parallel finish faster than sequential). Designing the evaluator to hit a latency budget is the same engineering as any other low-latency system: pick the cheapest configuration that meets the quality bar, parallelize where possible, cache aggressively, and measure tail latency rather than mean.
A practical pattern for inline evaluation: use a single fast judge as the inline gate, with a full panel run shadowed asynchronously. The inline judge is calibrated to be conservative β it blocks any output that fails the fast check, even if the full panel might have passed it. False positives (blocked outputs that the panel would have approved) are recorded for analysis and used to retune the inline judge over time. False negatives (outputs that the inline judge passed but the shadow panel later flagged) are recorded as quality incidents and used to identify rubric gaps.
This pattern keeps the inline latency low (one judge, fast model) while preserving the full panel's verdict quality through the shadow path. The customer experience is fast; the quality measurement is rigorous; the inline gate catches the worst outputs without paying for full panel latency.
The latency budget also affects what fraction can go inline at all. Sampling inline means some fraction of customer requests pay the latency penalty. If the penalty is small enough that customers do not notice the variance, you can sample inline freely. If the penalty is noticeable, sampled inline produces a customer-experience inconsistency that erodes trust. Most teams choose to keep all inline evaluation either always-on or none-on, avoiding the sampled-inline middle ground for this reason.
Evidence retention: how long to keep what
Live evaluation produces a stream of evidence from production. Evidence retention policy decides what to keep and for how long. The decisions have privacy implications, cost implications, and operational implications.
The operational case for retention is forensics. When an evaluation flags a problem, you want the original evidence to inspect. Without it, the verdict is just a number; with it, you can see exactly what happened and remediate root cause. Retention enables debugging.
The operational case against retention is cost and risk. Storage of evidence at production volume is expensive. Long retention windows mean more storage. Privacy regulations often constrain how long customer data can be retained; production evidence that contains customer inputs is subject to the same constraints.
The sensible retention policy is layered. Hot retention (full evidence available for fast access) for the most recent N days. Cold retention (evidence available with longer retrieval latency, optionally compressed and de-identified) for an additional period. Eventually, deletion or full anonymization beyond a retention horizon.
Numbers are domain-specific. A common pattern: 30 days hot, 180 days cold, deletion after 12 months. Privacy regulations may force shorter horizons. Forensic needs may justify longer.
The verdicts themselves have a different retention profile from the evidence. Verdicts are smaller, do not contain customer data (if the eval pipeline strips identifiers correctly), and are forever useful for trend analysis. Verdicts can be retained essentially indefinitely at low cost. The Replay Disclosure Policy from the first essay assumes verdicts are immutable and retained.
One specific risk: evidence retention can leak customer data if the agent's output contained sensitive information. The retention policy has to consider not just the input but also what the agent produced. If the agent might have output PII, the retention has to handle PII carefully β encrypt at rest, restrict access, strip on archive. This is more work than people expect.
A practical operational pattern: pseudonymize evidence at capture time. Replace customer identifiers with stable hashes. Replace specific PII fields with placeholder tokens. Keep the structural information (what kind of input, what kind of output) but remove the specific personal details. This reduces both privacy risk and the value of the data to attackers, while preserving most of the forensic utility for evaluation purposes.
The pseudonymization can be reversed for legitimate forensic needs (a customer complaint requiring inspection of their specific interactions) by maintaining a separate, secured mapping table. The mapping table itself becomes the sensitive asset; the evidence store becomes lower-risk.
Bias in the sample: the silent failure
The most subtle failure of live evaluation is bias in the sample. The sample is supposed to represent the production traffic; if it does not, the verdict is misleading. Bias can creep in through many paths.
First path: the sampling mechanism itself. If you sample by hashing some field of the request, you might inadvertently bias by request type β certain hash patterns may correlate with certain customer segments. Use a clean random sampler that does not correlate with request characteristics, or use a deliberate stratified sampler that intentionally over-samples specific cells.
Second path: the evidence capture. If certain request types fail to capture cleanly (the agent's output is malformed, the tool calls do not serialize), they get dropped from the sample. The sample then over-represents the cleanly-handled traffic, biasing the verdict upward. The capture path has to be robust enough to capture even the messy cases.
Third path: the evaluation pipeline. If the eval pipeline cannot handle certain evidence shapes (very long outputs, unusual tool call patterns), those cases get errored out. The sample then under-represents the cases the pipeline cannot handle, which are often the cases most worth evaluating. The pipeline has to be robust to the full range of evidence shapes.
Fourth path: the evaluator's rubric. If the rubric is well-defined for routine cases but ambiguous for edge cases, the verdicts on edge cases will be noisy or absent. The sample's effective verdict is biased toward the well-defined region. The rubric has to cover the full range of behaviors the sample contains.
Detecting bias requires measuring the sample against the population. The Coverage Heatmap is the relevant tool: compute the per-cell distribution in the sample and compare to the per-cell distribution in production. Significant divergence indicates bias. Re-tune the sampler to fix it.
This is monitoring work, not one-time setup. Production traffic distributions drift over time as customers' usage evolves. A sampler that was unbiased six months ago may be biased now because the underlying distribution shifted. Re-validate sample representativeness quarterly.
The Live Eval Sampling Plan
Here is the named artifact this essay produces. The Sampling Plan is a one-page document that any agent's live evaluation program should produce and review quarterly. It has nine sections.
Section one: shape. Specify the evaluation shape (full inline, full shadow, sampled shadow, sampled inline, or hybrid). For hybrid, specify which paths get which shape.
Section two: sample rate. State the target sample rate, the precision target it is meant to achieve, and the calculation that links them. Include both aggregate and per-cluster precision targets.
Section three: stratification. Specify the strata (capability clusters, customer tiers, channels, request types) and the per-stratum sampling rate. State the rationale for each stratum's rate.
Section four: latency budget. For inline evaluation paths, state the latency budget, the evaluator configuration that fits within it, and the customer-experience implications.
Section five: evidence capture. Specify what evidence is captured per sampled case, what serialization format it uses, and what completeness checks ensure the capture is robust.
Section six: pseudonymization. Specify what fields are pseudonymized at capture time, what fields are dropped entirely, and what mapping is maintained for forensic reversal.
Section seven: retention. Specify the hot retention period, the cold retention period, the deletion horizon, and any per-data-type variations.
Section eight: aggregation. Specify how the sampled verdicts roll up to the agent's composite score, how often the rollup runs, and how the rollup interacts with the lab suite verdicts.
Section nine: monitoring. Specify the sample-versus-population comparison process, the cadence of re-validation, and the trigger for sampler retuning.
The Plan is a living document. It changes when the agent's traffic patterns change, when the budget changes, when the precision needs change, when privacy requirements change. Quarterly review is the minimum cadence; major changes warrant immediate revision.
The Plan being explicit and reviewed is the discipline that separates production-grounded evaluation from theater. Without the Plan, sample rates drift, evidence capture gets sloppy, retention becomes accidental, and the verdicts become unreliable. With the Plan, every aspect of the program is intentional and defensible.
How live verdicts interact with lab verdicts
A mature evaluation program runs both lab and live evaluation. The two produce different verdicts on the same agent. How do they combine into a single picture?
The answer depends on what question you are trying to answer.
If the question is: "how does the agent perform on the standardized cases that define our quality bar?" β the lab verdict is the answer. Lab cases are reproducible, well-defined, and stable across time. They are the right tool for measuring against an explicit quality bar.
If the question is: "how does the agent perform on the actual cases customers are sending?" β the live verdict is the answer. Live cases are messy, drift with customer behavior, and reflect production reality. They are the right tool for measuring real customer experience.
If the question is: "is the agent ready to ship to production?" β both verdicts contribute. The lab verdict is the precondition (must pass standardized cases). The live verdict, in shadow mode before production launch, is the validation (handles real-traffic patterns). An agent that passes lab but fails shadow live should not ship.
If the question is: "is the agent ready for high-stakes use?" β both verdicts contribute, with the live verdict typically weighted more heavily because it reflects the actual conditions of use. A Gold-tier certification might require both a high lab score and a live score within an acceptable band of the lab score. The live-versus-lab gap itself becomes a quality signal.
The trust oracle should expose both verdicts, not just one. Buyers can choose to act on the lab verdict ("how good is this agent's underlying capability?") or the live verdict ("how does this agent perform in real conditions?") depending on their needs. Both are valid measurements; they answer different questions.
This is more nuanced than a single composite score. The nuance is the point. Buyers making serious decisions deserve verdicts that reflect both controlled-condition performance and field-condition performance. Hiding either reduces the information available to the decision.
What to do when live scores are worse than lab scores
The most predictable result of starting live evaluation is that live scores are worse than lab scores. The agent that scored 90 in lab will likely score 70-85 in live sampling, depending on how representative the lab cases were. This is the gap that motivated the essay.
The wrong response is to game it. You can game the gap by curating lab cases to be more like live traffic (which artificially inflates lab scores) or by filtering live samples to the cases most similar to lab (which artificially inflates live scores). Either is academic dishonesty applied to evaluation. The trust layer cannot tolerate it.
The right response is to investigate and remediate. The gap is information about where the agent is weaker than the lab suggests. Inspect the live cases the agent handled poorly. Identify patterns. Update the agent (better prompts, additional training data, better tool selection) to handle the patterns. Add representative cases from the patterns to the lab suite so future lab evaluations test for them. The gap should shrink over time as the agent improves on real-world conditions.
This is an iterative loop. Live evaluation reveals weaknesses; agent improvement remediates them; lab suite updates ensure the weaknesses stay tested in controlled conditions. The agent gets better. The gap closes. The composite score becomes a more reliable representation of real performance.
The trap is impatience. When the gap is first revealed, the temptation is to either disclose the live score and accept embarrassment or hide it and hope no one looks. Neither is the right move. The right move is to disclose with context ("live score is 71, lab score is 90, gap is being remediated, here is the timeline") and execute the remediation. Buyers reading the disclosure see both honesty and active improvement, which is the right signal.
The counter-argument
The sharpest counter-argument is that live production evaluation produces operational risk that exceeds its measurement value. The argument: live sampling adds infrastructure complexity, introduces failure modes (eval pipeline failures during production sampling can disrupt production logging), creates privacy exposure (production data flowing through additional systems), and generates organizational tension (live scores consistently below lab scores create pressure to either fudge or stop measuring).
All four points are real concerns. We have seen each of them in practice.
The response is that these are operational problems with operational solutions. Infrastructure complexity is mitigated by good engineering. Failure modes are mitigated by careful isolation (the eval pipeline should never block production paths). Privacy exposure is mitigated by pseudonymization and access controls. Organizational tension is mitigated by leadership commitment to measuring honestly even when the results are uncomfortable.
The alternative β not running live evaluation β produces worse outcomes. Agents continue to be sold on lab scores that overstate real performance. Buyers eventually figure out the gap and lose trust. The trust layer becomes a marketing artifact rather than an honest signal. The market for trustworthy agents shrinks because the trust signal is not credible.
The counter-argument is right that live evaluation is harder than lab evaluation. It is wrong that the additional difficulty is not worth it. The trust layer's value depends on its honesty; live evaluation is the discipline that makes honesty possible.
What Armalo does
Armalo runs sampled shadow evaluation on production traffic by default for any agent registered with the platform. The sampling is stratified by capability cluster to ensure per-cluster verdicts have adequate precision. Sample rates adapt to traffic volume to maintain a target weekly evaluated case count.
The shadow path is fully isolated from the production critical path. Eval pipeline failures cannot affect production response delivery. Evidence capture is robust to messy inputs and unusual output shapes; capture failures are logged but do not halt production.
Evidence is pseudonymized at capture time. Customer identifiers are replaced with stable hashes; PII fields are tokenized. A separate access-controlled mapping enables forensic reversal for legitimate inspection. Hot retention is 30 days; cold retention extends with compression for an additional period; full deletion respects domain-specific retention horizons.
The trust oracle exposes both lab and live composite scores. Both are presented with their respective lineages, sampling metadata, and confidence bounds. Buyers can query either or both. Certification tiers require both lab and live scores above their thresholds, with bounded gaps between them.
Live scores below lab scores trigger a remediation workflow for the agent operator. The workflow surfaces the live cases the agent handled poorly, the patterns identified, and recommended remediations. Operators can address the patterns and re-validate.
The Live Eval Sampling Plan is published per agent and reviewed quarterly. Plan changes are versioned with explicit rationale.
Frequently asked questions
Does live evaluation slow down customer responses? In the sampled shadow shape, no. The shadow path is fully off the customer-facing critical path. Inline evaluation does add latency on the paths it gates; that is a deliberate trade-off for safety-critical operations.
What sample rate is correct? Depends on traffic volume and precision target. Calculate from the precision target backward. 1,000 evaluated cases per week is a workable target for most public-facing agents; higher-traffic agents need correspondingly lower rates to hit that target.
Can live evaluation cover regions the lab missed? Yes, that is part of its value. Production traffic naturally exercises capability regions that lab cases did not anticipate. Live sampling discovers these regions; the discoveries feed back into lab suite updates.
What about privacy? Pseudonymize at capture time, restrict access to the mapping, respect domain retention horizons. Privacy and live evaluation are compatible if the privacy work is done carefully. Skipping the privacy work is not acceptable.
How do I disclose live scores to buyers? Alongside lab scores, with explicit lineage and confidence. "Lab score: 92. Live score (last 30 days, sample size 4,200): 84. Gap reflects real-world condition variance." Honest disclosure builds trust; hiding the gap erodes it.
What if my agent is too low-volume to sample meaningfully? Low-volume agents have to sample at high fractions, possibly 100 percent. The cost is real; the verdict precision is what it is. Either accept lower precision or invest in higher-volume usage to make sampling viable.
Does this work for batch agents that do not have real-time interactions? Yes, with adapted mechanics. Batch agents process inputs that can be sampled at the batch boundary; the shadow evaluation runs on the sampled inputs alongside production. The latency budget is more relaxed because the batch context is more tolerant.
What is the relationship to red-team evaluation? Complementary. Live evaluation samples real traffic; red-team evaluation constructs adversarial cases. Both feed verdicts about the agent. Live verdicts reflect real conditions; red-team verdicts reflect stress conditions. A complete program runs both.
Bottom line
Lab evaluation tells you how the agent performs on the cases you wrote. Live evaluation tells you how the agent performs on the cases your customers send. The two are different, and the difference matters. Sampled shadow evaluation lets you measure the live verdict without slowing down production. The Sampling Plan makes the program intentional and defensible. The discipline produces composite scores that reflect real performance, not curated performance. Buyers reading the scores get more useful information. Operators seeing the gaps fix them. The trust layer becomes credible because it is grounded in production reality, not just lab cases. Run live evaluation. Run it honestly. Disclose the results even when they are uncomfortable. The trust layer's value depends on it.
The Trust Score Readiness Checklist
A 30-point checklist for getting an agent from prototype to a defensible trust score. No fluff.
- 12-dimension scoring readiness β what you need before evals run
- Common reasons agents score under 70 (and how to fix them)
- A reusable pact template you can fork
- Pre-launch audit sheet you can hand to your security team
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading commentsβ¦