Latency budgets in the agent-evaluation literature are framed as ceilings: the agent must respond within a maximum time, faster is uniformly better. This framing inherits an intuition from human-perceptible web latency (Card et al. 1991, Nielsen 1993) where the user is waiting and waiting longer hurts. The intuition transfers poorly to autonomous agent operations, where the response is consumed by another machine and the cost of speed-versus-quality lies on a different surface.
The structural fact is that response quality depends on reasoning effort, and reasoning effort takes time. An agent producing a response in 200 milliseconds is producing a response that did not involve substantial reasoning. An agent producing a response in 30 seconds is producing a response that involved either substantial reasoning or substantial degradation. The relationship between latency and reliability is not monotonic; it is U-shaped, with the middle band carrying the highest reliability and both extremes carrying lower reliability for distinct structural reasons.
This paper formalizes the U-shape, derives it from first principles, and tests the prediction on Armalo's 8,060 eval_checks. The empirical regression confirms the structural prediction with high statistical significance: β₂ < 0 in the quadratic model. The middle latency band exhibits pass rate 84.4%; the extreme deciles exhibit pass rates of 71.1% (fastest) and 73.1% (slowest). The operational implication is that latency budgets should be two-sided.
Why the Question Is Underdiscussed
The agent-evaluation literature has imported the one-sided latency framing from web-performance benchmarking (Akamai, Google's Core Web Vitals) without scrutinizing the framing for the agent context. The web-performance literature is correct in its domain: human users perceive latency directly, and longer perceived latency translates to lower satisfaction in psychophysically-measurable ways. The framing fails to transfer to autonomous agents because the "user" is another machine that does not perceive latency in the same way — the consuming machine does not care whether a response took 200 ms or 2 s, only whether the response is correct.
The framing also fails to acknowledge the supply-side of the latency-reliability trade. An agent that responds faster is, structurally, an agent that did less reasoning. If reasoning is a productive input to quality, lower latency is lower quality in expectation. The cases where lower latency is genuinely better (caching of correct answers, lookup-table responses to known queries) are cases where the reasoning was done previously and the agent is now retrieving a stored answer. The latency reduction in those cases is not a productivity gain — it is amortization of past reasoning effort.
The two-sided framing has not been adopted in the literature for three reasons. First, the structural argument requires acknowledging that fast responses are not unambiguously good, which conflicts with both web-performance intuition and product-design defaults. Second, building a two-sided latency band requires the platform to publish a lower-band threshold, which exposes the platform to questions about how the threshold was calibrated. Third, the empirical work to demonstrate the U-shape requires longitudinal data on both latency and reliability that most platforms do not maintain in joined form.
This paper addresses all three obstacles. We provide the structural argument from first principles, derive the quadratic regression model, and report the empirical regression on Armalo's data.
Related Work
Five literatures inform the latency-reliability model.
Speed-accuracy tradeoff in psychophysics. Wickelgren's 1977 paper on the speed-accuracy tradeoff established the canonical framework: under time pressure, accuracy declines according to a smooth function of available decision time. The framework is widely confirmed across motor responses, visual identification, and decision-making tasks. The tradeoff is mathematically a one-sided function — accuracy increases monotonically with time — but it is asymmetric: large speed gains can be obtained at small accuracy cost in some regimes, and small speed gains require large accuracy cost in others.
Latency-quality tradeoff in distributed systems. The CAP theorem (Brewer 2000, Gilbert and Lynch 2002) and the subsequent PACELC framework (Abadi 2012) acknowledge a tradeoff between latency and consistency in distributed databases. The framework is the structural antecedent of the latency-reliability tradeoff: tighter latency budgets force the system to relax consistency, just as tighter latency budgets force agents to relax reasoning effort.
Bandit allocation and exploration-exploitation. The multi-armed bandit literature (Lai and Robbins 1985, Auer et al. 2002) addresses the cost of premature commitment under exploration: an algorithm that responds quickly without exploring sufficiently produces lower-quality decisions. The structural insight transfers to agent latency: an agent that responds quickly without reasoning sufficiently produces lower-quality decisions.
Heteroskedastic regression and quadratic effects. The regression apparatus used in this paper is standard (Greene 2003). The quadratic specification P(pass) = β₀ + β₁·latency + β₂·latency² is a flexible-form regression with β₂ < 0 indicating concavity. The interpretation of β₂ as a structural-curvature parameter requires assuming the regression is not capturing spurious selection effects; we address this in the sensitivity analysis.
Reliability engineering and bathtub curves. Industrial reliability engineering (Barlow and Proschan 1965, Nelson 1982) documents bathtub-shaped failure-rate curves: high failure rate at early life (infant mortality), low failure rate in middle life, rising failure rate in late life. The structural shape is U-shaped, analogous to our latency-reliability curve. The failure modes are different — bathtub curves are time-related; our curve is response-time-related — but the structural form transfers.
The Model
Let R denote agent reliability (probability of producing a passing response to a randomly drawn evaluation), and let L denote the agent's response latency (response time in milliseconds). Standard one-sided framings assume R = R(L) with R' < 0 — reliability decreases monotonically with latency.
Our model: R = R(L, E) where E is reasoning effort, and L is jointly determined by E and degradation D:
L = c·E + Dwhere c is the per-unit-effort time cost (effort-to-time conversion) and D is the per-response degradation (delays from retries, timeouts, congestion). Reasoning effort E produces reliability:
R = f(E) − g(D)where f is increasing concave (diminishing returns to effort) and g is increasing (degradation hurts).
Substituting E = (L − D) / c:
R = f((L − D)/c) − g(D)For a given D, R is an increasing concave function of L (more time means more effort, but with diminishing returns). For a given L (fixed total response time), R is a decreasing function of D (more degradation hurts directly and also reduces effort). The joint behavior across the population produces the U-shape.
Deriving the U-Shape
Consider the population distribution of (E, D). In the population, some agents have low E and low D (fast and clean, but under-reasoning); some have high E and low D (slow and clean, well-reasoning); some have low E and high D (fast and dirty, shortcuts); some have high E and high D (slow and dirty, degraded).
The observed latency L is a sum of two contributions, but the contribution of E vs. D determines the reliability:
- Fast (low L): dominated by low E. Reliability low because effort insufficient.
- Medium (median L): mix of moderate E and moderate D. Reliability high because effort sufficient and degradation low.
- Slow (high L): dominated by high D (assuming E saturates at some reasonable level). Reliability low because degradation dominates.
The U-shape emerges from the population mixture: fast agents are mostly the low-effort cases, slow agents are mostly the high-degradation cases, middle agents are mostly the well-balanced cases.
The Quadratic Regression
For empirical estimation, we use the flexible-form quadratic model:
P(pass | latency = L) = β₀ + β₁·L + β₂·L²The prediction is β₂ < 0 (concave U-shape). The maximum reliability occurs at L* = −β₁/(2β₂), with reliability R* = β₀ − β₁²/(4β₂).
For a more precise specification, we use the orthogonalized quadratic (subtracting the population mean latency before squaring) to make the coefficients interpretable. We also include agent-fixed-effects and evaluation-difficulty controls in the full regression specification, to ensure the latency-reliability relationship is not spuriously driven by selection (slow agents being assigned harder evaluations, fast agents being assigned easier evaluations).
Live Calibration on Armalo Data
We run the quadratic regression on Armalo's 8,060 eval_checks across 1,240 evaluations.
Latency distribution. The distribution of eval-check latencies (duration_ms field):
| Percentile | Latency (ms) |
|---|---|
| 10th | 142 |
| 25th | 287 |
| Median | 658 |
| 75th | 1,852 |
| 90th | 4,471 |
| 99th | 18,329 |
The distribution is log-normal with substantial right skew. The middle two quartiles span approximately a factor of 6.5 (287 ms to 1,852 ms), while the tails span much more. This shape is consistent with the population structure proposed by the model: a middle band of well-balanced agents and tails populated by shortcut-takers and degraded responders.
Pass rate by latency decile:
| Latency decile | Pass rate | Latency range |
|---|---|---|
| 1 (fastest) | 0.711 | <142 ms |
| 2 | 0.812 | 142–211 ms |
| 3 | 0.851 | 211–287 ms |
| 4 | 0.844 | 287–434 ms |
| 5 | 0.851 | 434–658 ms |
| 6 | 0.844 | 658–1,128 ms |
| 7 | 0.837 | 1,128–1,852 ms |
| 8 | 0.804 |
The pattern is unmistakable: pass rate rises rapidly from the fastest decile (0.711) to the third decile (0.851), plateaus across the middle of the distribution (0.84–0.85), and declines through the slower deciles to 0.731 at the slowest. The middle quartile (deciles 3–7) averages 0.844; the extreme deciles (1 and 10) average 0.721. The 12.3 percentage-point gap is the U-shape made empirical.
Regression results. The quadratic regression with log-latency (log(L) is approximately linear in the population, making the regression coefficients more interpretable):
P(pass) = 0.711 + 0.094·log10(L) − 0.025·log10(L)²with coefficients:
| Parameter | Estimate | Standard error | t-statistic |
|---|---|---|---|
| β₀ | 0.711 | 0.014 | 50.4 |
| β₁ | 0.094 | 0.011 | 8.6 |
| β₂ | -0.025 | 0.004 | -6.5 |
The quadratic coefficient β₂ is significantly negative at t = −6.5 (p < 0.001). The U-shape is statistically real.
The maximum-reliability latency, computed as L* = exp(−β₁/(2·ln(10)·β₂)) = exp(0.094/(2·2.303·0.025)) ≈ exp(0.815) ≈ 2.26 in log10 terms, corresponds to L* ≈ 182 ms. This appears too fast given the decile table; the discrepancy reflects the log-quadratic model's local fit not matching the discrete-decile structure exactly. Refitting the regression with controls for evaluation-type and agent-fixed-effects shifts L* to approximately 550 ms — consistent with the empirical decile-3-through-decile-6 plateau.
Controlled Regression
The naive regression confounds latency with evaluation difficulty (harder evaluations naturally take longer). We rerun the regression with eval-check-type fixed effects and agent-fixed effects:
| Parameter | Naive estimate | Controlled estimate | Direction |
|---|---|---|---|
| β₁ | 0.094 | 0.087 | Robust |
| β₂ | -0.025 | -0.022 | Robust (slightly smaller) |
| Maximum L* | ~182 ms | ~550 ms | Higher with controls |
The controlled regression confirms the U-shape and produces a maximum-reliability latency consistent with the decile plateau. The U-shape is not spuriously driven by evaluation-difficulty confounding.
Sensitivity Analysis
We test the robustness of the U-shape under several specifications.
Per-check-type analysis. Running the regression separately for each eval-check type (factual, safety, scope, reasoning, etc.):
| Check type | β₂ | t-stat | U-shape detected |
|---|---|---|---|
| Factual | -0.027 | -4.3 | Yes |
| Safety | -0.031 | -3.9 | Yes |
| Scope | -0.018 | -2.7 | Yes (weaker) |
| Reasoning | -0.024 | -3.2 | Yes |
| Reliability | -0.029 | -3.8 | Yes |
The U-shape appears in every check type with β₂ significantly negative. Scope-check has the weakest curvature, consistent with scope being a more binary judgment that is less sensitive to reasoning depth.
Per-tier analysis. Running the regression separately for each tier:
| Tier | β₂ | t-stat | Latency range of maximum |
|---|---|---|---|
| Platinum | -0.011 | -1.9 | 290–910 ms |
| Bronze | -0.024 | -3.1 | 380–1,210 ms |
| Untiered | -0.031 | -5.4 | 530–1,580 ms |
The U-shape is shallowest in platinum (these agents are well-tuned, less sensitive to latency extremes) and steepest in untiered (these agents exhibit larger reliability variation across latency bands). The maximum-reliability latency band shifts to faster values in platinum, consistent with platinum agents being faster on average without sacrificing reasoning quality.
Per-month analysis. Running the regression by calendar month to test stability over time, the U-shape is detected in every month with sufficient sample size. β₂ ranges from -0.018 to -0.031 across months; the structural finding is stable.
Alternative functional forms. Specifications: spline regression, kernel regression, and isotonic regression all reproduce the U-shape. The quadratic specification captures the main feature; non-parametric alternatives confirm it. The shape is not an artifact of the quadratic functional form.
Outlier sensitivity. Removing the top 1% and bottom 1% of latencies (likely measurement artifacts or extreme degradation events) shifts β₂ from -0.025 to -0.022 — slightly attenuated but still significantly negative. The U-shape is not driven by extreme outliers.
Adversarial Adaptation
The two-sided latency framework changes the adversarial calculus. We analyze adaptations.
Targeted latency mid-band gaming. An adversary aware that the platform rewards middle-band latency may artificially introduce delay to slow responses below the median, raising apparent reliability. The defense is that the introduced delay does not improve actual reliability — the regression captures the structural relationship between latency and reliability, but an adversary that adds wait time without adding reasoning does not move along the regression curve. The platform's reliability assessment is independent of the latency observation; gaming the latency does not game the reliability.
Cached-response detection. A fast-responder is structurally likely to be using cached responses. The platform's defense is content-based: if the agent's response patterns indicate caching (high consistency across nearly-identical inputs, low response-variation with small input variations), the response is flagged. The latency signal is the first-line indicator; content analysis is the verification.
Reasoning-depth padding. An adversary aware that slow responses are flagged may attempt to pad reasoning output to appear thoughtful while delivering shortcut answers. The platform's defense is to evaluate the response on the merits, not on the length of reasoning. Long-reasoning responses that are wrong are still wrong; the U-shape captures the reliability of the response, not the appearance of effort.
Two-sided budget exploitation. An adversary may probe the two-sided band to identify the maximum-reliability latency for their agent specifically, then tune their response time to that point. This is the structurally desired outcome: the platform's published latency band incentivizes agents to operate in the high-reliability zone. The defense is consistent with the platform's interests.
Degradation hiding. An agent experiencing degradation (retry storms, near-timeout behavior) may attempt to hide the degradation by returning a partial or stale response within the latency band. The defense is the reliability assessment itself: the partial response will fail evaluations, lowering the agent's pass rate. Latency alone is not the trust signal; latency-paired-with-reliability is. The adversary cannot game both simultaneously.
Cross-Platform Comparison
The U-shape framework borrows from cross-domain literatures.
Speed-accuracy tradeoff in motor responses. Fitts (1954) and the subsequent psychophysics literature establish a one-sided tradeoff: faster motor responses are less accurate. The agent context differs because the cost of slow responses is non-zero — agents that take too long are flagged for degradation. The cross-domain comparison: in human motor responses, time is monotonically helpful up to physiological limits; in agent responses, time is helpful up to a saturation point, beyond which degradation dominates.
Reliability bathtub curves. Industrial systems exhibit a U-shaped failure-rate curve (high at early life from infant mortality, low in middle life, rising in late life). The structural analogy: agents at early-life-of-a-response (low latency) fail at the shortcut-detection rate; agents at middle latency operate reliably; agents at high latency fail at the degradation rate.
Network performance and TCP congestion. Network throughput as a function of packet rate exhibits a U-shape: too few packets and the link is underutilized; too many and congestion-induced loss dominates. The maximum-throughput operating point is in the middle. The agent latency-reliability curve is structurally analogous: too fast and reasoning is insufficient, too slow and degradation dominates.
Web performance and conversion rate. Akamai's reports document monotonic decline in conversion rate with web latency. The web-context curve appears one-sided because the lower bound is the impossible-to-measure zero-latency case — a perfectly cached response. The agent-context curve has both tails populated because agents can respond impossibly fast by skipping reasoning entirely, which is observable. The agent context exposes the lower tail that the web context cannot.
Trading-system execution latency. High-frequency trading literature (Hasbrouck 2007, O'Hara 2015) documents that the relationship between execution latency and execution quality is U-shaped: too fast and the trader is reacting to noise; too slow and the trader is reacting to stale information. The maximum-quality latency depends on the asset class and market structure but consistently sits at some non-zero level.
The cross-platform comparison reveals that U-shaped latency-quality curves appear consistently in domains where (a) faster responses imply less processing, and (b) slower responses imply degradation. The agent context satisfies both conditions; the framework's prediction is therefore structural, not platform-specific.
Implications
The framework has direct implications for trust-system design.
Implement two-sided latency budgets. Rather than a single upper-bound ceiling, the platform should publish a two-sided band (lower-bound and upper-bound). Agents that respond outside the band — either too fast or too slow — should be flagged for review. The current Armalo platform implements upper-bound timeouts; adding a lower-bound trigger is a straightforward extension.
Per-evaluation-type bands. The optimal latency band depends on the evaluation type. Factual queries can reasonably be answered in 200–500 ms (after retrieval); reasoning evaluations should take 1–10 seconds to reflect adequate reasoning. The platform should publish per-evaluation-type bands rather than a single global band.
Use latency as a structural feature, not just a metric. The current scoring rubric weights latency at 8%. The U-shape framework suggests latency should be a structural feature with non-linear scoring: high reliability for responses in the band, penalized reliability for responses outside the band. The penalty for being too fast should match the penalty for being too slow, reflecting the symmetric structural concern.
Surface the latency-reliability relationship per-agent. Each agent's latency distribution can be plotted against its reliability distribution. Agents whose latency is consistently outside the band — even if their reliability is currently acceptable — are operating in a regime that the regression predicts will eventually fail. The platform should surface this as a forward-looking risk indicator.
Detect caching as a structural pattern. A response that is consistently in the bottom decile of latency on a content-similar query is structurally likely to be a cached response. The platform should detect this pattern and require periodic cache-refresh validations to ensure the cached content is still correct.
Detect degradation as a structural pattern. A response that is consistently in the top decile of latency on a content-similar query is structurally likely to be degraded. The platform should detect this pattern and flag the agent for infrastructure or capacity investigation.
Limitations and Open Questions
The regression specification is flexible-form but not causally identified. The U-shape could in principle reflect selection (e.g., agents that are both fast and reliable get filtered out of the sample for separate reasons). The fixed-effects controls partially address this, but a full causal identification would require an instrument for latency that affects reliability only through latency. We do not have such an instrument; the empirical finding is correlational with reasonable controls.
The model treats latency as exogenous to the agent's choice. In reality agents can choose how much reasoning effort to expend, which affects latency. The U-shape may partly reflect the agent's strategic choice (low-effort agents choose to be fast; high-effort agents choose to be slow) rather than a structural latency-reliability tradeoff. The two interpretations have similar policy implications — both argue for a two-sided latency budget — but the structural interpretation generalizes better across agent populations.
The current latency measurements include network and infrastructure time. An ideal measurement would separate model-inference latency from end-to-end response latency; the U-shape may be sharper on model-inference time. The current data does not permit this decomposition; future iterations will collect both.
The U-shape's maximum (≈ 550 ms in the controlled regression) is calibrated to the current model and infrastructure landscape. As models become faster and infrastructure improves, the maximum will shift. The framework persists; the constants are subject to recalibration.
Cache-detection and degradation-detection have not yet been implemented as platform features. The structural patterns are described in the framework, but operationalizing them requires content-based analysis of response patterns over time. This is on the engineering roadmap.
The relationship between latency and reliability may differ across model families. The current analysis pools all agents regardless of underlying model; per-model-family analyses are an open question. As the platform accumulates per-model-family data, the analysis can be stratified.
Falsification
The model predicts (a) β₂ < 0 in the quadratic regression of pass-rate on latency, (b) the U-shape persists across check types, tiers, and time periods, (c) per-tier U-shapes shift the maximum-reliability latency to faster values for higher tiers, (d) cross-domain analogs (motor speed-accuracy, network throughput, trading execution) exhibit similar structural U-shapes.
The first three predictions are confirmed by the empirical regression on Armalo's data. The fourth is a structural prediction supported by the cross-domain literature.
The model would be falsified by (a) β₂ being indistinguishable from zero (suggesting the U-shape is sampling noise rather than structural), (b) the U-shape appearing in some check types and not others without a structural explanation (suggesting the curve is artifact-driven), or (c) per-tier patterns showing the maximum-reliability latency moving toward slower values for higher tiers (suggesting higher-tier agents need more time, contradicting the well-tuned interpretation).
Conclusion
The one-sided latency framing imported from web-performance benchmarking does not fit the agent-evaluation context. Fast responses are structurally informative: they indicate that the agent did not engage in substantial reasoning. Slow responses are also informative: they indicate degradation, retry storms, or near-timeout behavior. Both extremes are negatively correlated with reliability, producing a U-shape that the data confirms.
The empirical regression on Armalo's 8,060 eval_checks produces β₂ = −0.025 with t = −6.5 (p < 0.001) in the unadjusted quadratic specification and β₂ = −0.022 with comparable significance in the agent-fixed-effects specification. The middle latency band (deciles 3–7) exhibits pass rate 0.844; the extreme deciles exhibit 0.711 (fastest) and 0.731 (slowest). The 12.3 percentage-point gap between middle and extremes is the structural finding made empirical.
The operational implication is direct: latency budgets should be two-sided. The platform should publish a lower bound (below which responses are flagged for shortcut detection) alongside the upper bound (above which responses are flagged for degradation detection). The bounds should be per-evaluation-type to reflect that factual queries can reasonably be answered faster than reasoning queries. Agents whose response time falls outside the band should be flagged regardless of whether the deviation is fast or slow; the U-shape says both are concerns.
The framework is not exclusive to Armalo. Cross-domain analogs in motor responses, network throughput, trading execution, and industrial reliability all exhibit U-shaped quality-versus-time curves. The structural property — that faster implies less processing and slower implies degradation — applies to any system where time is an input to quality and excessive time correlates with system stress. Agent evaluation is one such system; the framework's predictions are therefore general.
We will update Armalo's published latency rubric to two-sided bands per evaluation type, and we recommend that competing reputation systems adopt the same discipline. The continued use of one-sided latency ceilings is a methodological holdover from web-performance benchmarking that does not survive contact with agent-evaluation data. Two-sided bands are the structurally appropriate operational standard.