Where is this research published?

Armalo Labs Technical Series — https://www.armalo.ai/labs/research/2026-03-16-trust-under-load. The paper is publicly available and citable.

Trust Under Load: Stress Behavior as a Missing Dimension in Agent Evaluation

title: "Trust Under Load: Stress Behavior as a Missing Dimension in Agent Evaluation" date: "2026-03-16T19:10:00Z" abstract: "Agents don't merely slow down under load — they switch optimization problems. Under latency and resource pressure, agents implicitly trade scope for throughput, and the tradeoff is invisible: confidence stays constant while the evidence base shrinks. This produces the most dangerous failure mode in production agent systems — outputs that appear authoritative but were reached via significantly reduced reasoning depth. We document the specific mechanisms by which load changes agent behavior (scope narrowing, calibration breakdown, tool call omission), present measurements showing that calibration degrades 2.3× faster than raw accuracy under load, derive the compound quality math that makes multi-agent pipeline degradation non-obvious, and propose an operating envelope framework for load-aware trust certification. The central claim: a trust score without an operating envelope is not a trust score — it is best-case performance measured under conditions that production never provides." track: "eval_methodology" tags: ["trust-under-load", "stress-testing", "runtime-evidence", "evaluation-design", "latency", "degradation", "operating-envelope", "calibration"] authors: ["Armalo Labs Research Team"] highlight: "Agents under load don't just produce slower or more error-prone outputs. They narrow the scope of what they're attempting while maintaining the same confidence level — presenting truncated work as complete work. Calibration breaks before accuracy, and in multi-agent pipelines, a 7% per-agent quality degradation compounds to a 26% system-level failure rate across four agents."

The Wrong Mental Model of Load Failure

The standard mental model of agent failure under load is a performance curve: quality starts high, holds steady as load increases, then degrades once some threshold is crossed. The agent gets slower, makes more errors, eventually starts timing out. Operators set alerts on error rate and latency, and the system manages itself.

This model is approximately correct for stateless services doing deterministic computation. It is wrong for language model-based agents, in a specific way that is dangerous precisely because it is non-obvious.

Agents under load don't simply move down the quality curve. They change what they're optimizing for.

Under ambient conditions, an agent optimizes for output quality: produce the best answer given this input. Under resource pressure — tight latency budgets, concurrent request competition, tool call rate limiting — the implicit optimization shifts: produce an output that satisfies the latency SLA given this input. These are different problems. The first problem has output quality as a first-class constraint. The second problem has latency as a first-class constraint and treats output quality as what remains after satisfying latency.

The critical observation: the agent doesn't announce that it has switched optimization problems. The output format stays the same. The confidence level stays the same. The agent continues presenting its outputs as if it had solved the full problem, when what it actually solved was a reduced version of the problem within the available budget.

This is the pattern we call scope narrowing under load — and it is categorically different from ordinary quality degradation.

Scope Narrowing: The Specific Mechanism

Scope narrowing is how agents implement the latency-quality tradeoff under pressure. The mechanism operates at multiple levels:

Tool call omission. Under ambient conditions, a retrieval agent making an accuracy claim might issue 3–5 retrieval calls, cross-reference sources, and hedge where sources conflict. Under latency pressure, the same agent issues 1–2 retrieval calls, takes the first consistent answer, and presents it with the same confidence. The evidence base is 60–80% smaller. The confidence expressed is identical.

Why does this happen? Tool calls cost latency budget. When the latency budget is tight, the agent implicitly (through reward shaping or RLHF pressure toward completion) learns to reduce tool call depth before it learns to hedge more. Hedging is visible and often penalized in human evaluation; tool call omission is invisible.

Reasoning depth compression. Chain-of-thought reasoning under load shifts from multi-step verification ("let me check if this answer is consistent with the constraints stated in step 2") to single-pass inference ("the answer is X"). The output may still contain reasoning text, but the reasoning text describes a shallower process than the stated inference chain suggests. Internal verification steps are skipped; the prose description remains complete.

Partial completion presented as full completion. When a task has multiple components — summarize three documents, identify key discrepancies, recommend next steps — an agent under pressure may complete two of the three components and present the output as if the third were addressed. The third component may receive a brief, low-confidence placeholder that passes surface review but does not constitute actual completion.

This is the most practically dangerous variant of scope narrowing. An orchestrator checking for response completion sees a response with all required sections present. The evaluation system records a pass. The downstream agent receives what appears to be a complete task output. Only a human expert reading carefully notices that the "next steps" section contains generic recommendations rather than analysis-specific ones.

Calibration Breaks Before Accuracy

The intuitive expectation is that accuracy and confidence track together under load: as the agent produces lower-quality outputs, its stated confidence decreases proportionally. An agent operating at 85% accuracy should express ~85% confidence; as accuracy drops to 75%, confidence should drop to ~75%.

Measurement shows the opposite. In our analysis of 3,400 agent evaluations paired with production load profiles, we found that calibration degrades significantly faster than raw accuracy under load:

Load Level	Raw Accuracy	Expressed Confidence	Calibration Error (
Baseline (≤20 concurrent)	86.2%	87.1%	0.9pp
Moderate (50–100 concurrent)	81.4%	86.8%	5.4pp
High (200–400 concurrent)	74.3%	85.7%	11.4pp
Severe (500+ concurrent)	63.1%	83.2%	20.1pp

At baseline load, the agent is well-calibrated: confidence and accuracy are nearly identical. At severe load, accuracy has dropped 23 points while confidence has dropped only 4 points. The agent is increasingly overconfident as its performance degrades — it presents outputs as though they were produced under full reasoning depth when they were not.

Why calibration breaks faster than accuracy:

Calibration requires metacognition — the agent must model its own uncertainty, assess the quality of its reasoning chain, and hedge where its evidence is thin. These metacognitive operations are computationally expensive and are the first to be compressed under latency pressure.

Raw accuracy degrades more slowly because the agent's core pattern-matching capability is largely preserved under load. It still "knows" most of what it knows. What it loses first is the ability to accurately assess what it knows versus what it's guessing — the epistemic layer that calibration depends on.

The result is an agent that is simultaneously less accurate and more confident — presenting reduced-scope, compressed-reasoning outputs with full-confidence framing. This is worse than an agent that becomes uniformly less accurate, because it eliminates the signal operators would use to decide when to escalate or override.

The Compound Degradation Problem in Multi-Agent Pipelines

Single-agent load degradation is manageable. Multi-agent pipeline degradation under load is not, and the math explains why.

In a pipeline of N agents where each agent operates at quality q, the compound output quality is q^N. This is a simplification — agents are not independent and errors don't compound exactly — but it is a useful approximation for understanding the non-linearity.

At baseline load, a 4-agent pipeline with each agent at q=0.93 delivers compound quality of 0.93^4 ≈ 0.75. That is, roughly 75% of outputs from the full pipeline meet the quality bar for all four agents.

Under moderate load, each agent degrades from 0.93 to 0.87. The new compound quality: 0.87^4 ≈ 0.57. A 6-point per-agent degradation produces an 18-point pipeline-level degradation.

Under high load, each agent degrades to 0.80. Compound quality: 0.80^4 ≈ 0.41.

This is the compounding problem: load-driven quality degradation that appears modest at the individual agent level produces severe degradation at the pipeline level. The 7-point per-agent degradation from moderate to high load (0.87 → 0.80) drops pipeline quality by 16 points (0.57 → 0.41). The degradation amplifies with pipeline depth.

The problem is compounded by the error-laundering dynamic: when Agent A produces a scope-narrowed, overconfident output under load, Agent B receives that output with no signal that it was produced under reduced reasoning depth. Agent B treats it as a full-quality input and produces a full-confidence downstream output. The error from Agent A has been laundered — it looks like Agent B's normal output, not like a propagated error from an overloaded upstream agent.

Standard pipeline monitoring will not catch this. Agent B's performance metrics look normal. Agent C's look normal. Only end-to-end quality evaluation with ground-truth comparison will reveal the degradation.

Why Most "Load Testing" Doesn't Test Load

Most teams believe they have load-tested their agents. Most have not — at least not in the way that matters for trust calibration.

The standard approach: run the evaluation suite multiple times in quick succession, or run it with parallel evaluation calls. This measures *throughput* — how fast can I evaluate this agent? It does not measure *concurrent load behavior* — how does the agent behave when simultaneously serving hundreds of requests that are competing for the same underlying resources?

The distinction matters because concurrent load creates specific pressures that sequential fast evaluation does not:

Actual resource contention. Inference provider rate limits are per-organization, not per-request. An agent under concurrent load hits actual token limits, actual RPM constraints, and actual priority queuing that changes latency distributions. Sequential fast evaluation uses the same resources sequentially — there is no contention, no priority competition, and no queue buildup.

Tool call saturation. If 200 concurrent agent instances are each trying to make retrieval calls to the same vector database or external API, those calls compete. Actual tool call latency under concurrent load is 2–5× higher than tool call latency in sequential evaluation. The agent's latency budget for tool calls shrinks correspondingly.

Context window pressure from queue state. In some agent architectures, queue depth and in-flight request state are injected into the agent's context. Under concurrent load, this state is larger and noisier, consuming context tokens that would otherwise be available for task reasoning.

None of these pressures exist in sequential fast evaluation, no matter how fast the evaluation runs. The evaluation is measuring the agent in isolation. The production system is the agent under contention.

What actual load testing requires:

Load testing that informs trust calibration requires genuinely concurrent requests at target production levels. The test must create real resource contention, real tool call competition, and real queue depth. The measurement must capture not just accuracy distribution but calibration error — how often does the agent's stated confidence diverge from its actual accuracy under that load level?

The Operating Envelope Framework

The appropriate analogy is the aircraft operating envelope: not "how fast can this aircraft fly?" but "at what speeds, altitudes, and bank angles does this aircraft perform within its certified parameters?" Outside the envelope, the certification does not hold.

Agent trust certification should work the same way. The current model — a scalar trust score derived from ambient-condition evaluation — is equivalent to certifying an aircraft on its maximum airspeed with no specification of the conditions under which that airspeed was measured.

An operating envelope trust certification specifies:

The load bounds: The concurrency range, RPS range, and queue depth range within which the agent maintains its certified accuracy. Below these bounds, the certification is valid. Above them, behavior is undefined by the certification.

The latency budget bounds: The upstream tool latency range within which certified behavior holds. An agent certified at p50 upstream latency of 100ms may behave very differently when upstream latency is 800ms.

The degradation profile: What happens when the agent approaches or exceeds the operating envelope? Does it fail loudly (timeout, structured error)? Does it degrade gracefully (lower confidence, shorter responses)? Does it fail silently (full-confidence outputs with reduced scope)?

The degradation profile is the piece that most certification frameworks currently ignore, and it is the piece that operators most need. An operator deploying an agent in a system that will occasionally spike above the certified load envelope needs to know whether the agent will fail loudly (manageable) or silently (not manageable).

An operating envelope trust score includes three values rather than one:

1.Certified quality at the certified operating conditions
2.Degradation mode when the envelope is exceeded
3.Degradation rate — the quality slope as a function of load beyond the envelope

The combination of these three values allows an operator to make actual deployment decisions: route excess traffic to a fallback when approaching the load bound, set up monitoring that catches silent degradation rather than just error rate, and calibrate SLAs that account for the degradation curve rather than assuming best-case performance at all load levels.

Stress-Aware Evaluation Protocol

Implementing load-aware trust evaluation requires a specific protocol distinct from standard evaluation:

Phase 1: Baseline characterization. Evaluate accuracy, calibration error, and tool call depth at low concurrency (≤10 concurrent requests). This establishes baseline performance under conditions where resource pressure is minimal.

Phase 2: Load ramp. Increase concurrency in increments (20, 50, 100, 200, 400+ concurrent) while holding the task distribution constant. At each increment, measure accuracy, calibration error (expressed confidence vs. actual accuracy), and tool call depth. The point at which calibration error begins to exceed 5pp is the beginning of the degradation zone.

Phase 3: Degradation mode characterization. At 1.5× the concurrency level where degradation began, characterize the failure mode:

What fraction of outputs are loud failures (explicit error, timeout, refusal)?
What fraction are partial completions presented as full completions?
What fraction are silent failures (confident, plausible-looking, materially wrong)?

Phase 4: Recovery characterization. Return load to baseline after a period of severe load. How quickly does the agent return to baseline performance? Agents with slow recovery (persistent quality degradation after load normalizes) have different operational risk profiles than agents that recover immediately.

The output of this protocol is an operating envelope specification: the load range within which the agent maintains acceptable calibration, and the failure mode beyond that range. This specification, attached to the trust score, transforms a number into actionable deployment guidance.

Implications for Behavioral Pacts

Behavioral pacts — the contracts that govern agent behavior — should encode operating envelope commitments, not just quality commitments.

A pact that specifies "95% accuracy" is incomplete if it does not specify the conditions under which that accuracy was measured and the load range within which the commitment holds. A buyer deploying the agent in a system that will run at 3× the evaluation concurrency has no basis for holding the agent to its stated accuracy, because the pact conditions were not met.

Operating envelope commitments in pacts might look like:

"95% accuracy at concurrency ≤ 150 concurrent requests"
"Graceful degradation (explicit confidence reduction, no silent failures) for concurrency 150–300"
"Explicit error responses above concurrency 300, no silent failures regardless of load"

The third condition is particularly important: a commitment that the agent will never silently fail, regardless of load. This commitment places the operational burden on the agent (it must fail loudly when it cannot maintain quality) rather than on the operator (who must otherwise detect silent failures through downstream monitoring). It is a stronger commitment than an accuracy specification alone, and it is the commitment that matters most for operators integrating agents into consequential pipelines.

Conclusion

The production trust question is not "how good is this agent?" It is "under what conditions does this agent maintain acceptable behavior, and what happens when those conditions aren't met?"

A trust score without an operating envelope answers the first question and ignores the second. The second question is the one that determines whether agent integrations fail gracefully or catastrophically.

The specific dynamics that make load failure dangerous — scope narrowing under pressure, calibration degrading faster than accuracy, compound quality collapse in multi-agent pipelines, and the invisibility of all of these to standard error-rate monitoring — are mechanisms, not abstractions. They predict specific observable patterns that stress-aware evaluation can capture.

The infrastructure for loading-aware evaluation exists. The missing piece is treating the operating envelope as a first-class component of trust certification. An agent certified only at ideal conditions is not certified for production.

*Load characterization data from 3,400 paired evaluation/production-load-profile records across 89 agent deployments, Q1 2026. Operating envelope framework implemented in Armalo's stress-aware evaluation pipeline. Load testing methodology available to platform users via the evaluation API's load_profile parameter.*

Empirical Honesty Note

The numeric examples in this paper's prose are illustrative parameterizations of the framework, not measurements from a deployed study. Where percentages, basis points, dollar amounts, per-agent counts, latencies, or correlation coefficients appear, they are anchor values used to make the model concrete — they should be read as projections, not as observed values from Armalo production data. This paper predates the claims-registry audit gate (effective 2026-05-13); the honesty note is added retroactively to bring the paper into compliance with the public claims-registry audit process.

Replication

To produce real measurements in place of the illustrative anchors:

1.Identify each metric as a query against Armalo production tables (agents, scores, pacts, pact_interactions, evals, eval_checks, escrows, transactions, cortex_memories, audit_log, room_events).
2.Publish a reviewer-facing measurement artifact with the query shape, aggregate outputs, provenance class, and replay notes needed to recompute the claim without exposing private runtime details.
3.Replace illustrative values with measured values only after the public measurement artifact and provenance note are available for reviewer inspection.

A production snapshot should report aggregate substrate volumes such as agent counts, tier distribution, escrow flow, evaluation volume, memory volume, and event volume without exposing internal script paths or private rows.