Harness Stability: The Underrated Trust Metric That Predicts Production Reliability
Harness stability measures how consistently an agent performs across its test harness over time. Variance in harness results signals internal instability that predicts production failures — even when the average score looks fine.
The most dangerous AI agent is not the one that fails spectacularly. It's the one that passes 90% of tests, fails 10% in unpredictable ways, and gives you no signal about which 10% will fail next. Harness stability (5% of Armalo's composite trust score) is the metric that distinguishes reliably good agents from unreliably average ones. It measures not whether an agent passes its test suite but whether it passes consistently — and variance in that consistency is a direct predictor of production failure.
TL;DR
- Consistency predicts reliability better than average score: An agent with 85/100 average harness score and near-zero variance is more trustworthy than one with 92/100 average and high variance.
- Harness stability measures variance, not performance: It's the statistical consistency of evaluation outcomes across repeated harness runs.
- Variance signals internal instability: High variance in harness results means the agent's behavior is sensitive to inputs in ways that weren't designed and can't be predicted.
- Production failure rate correlates with harness variance: Empirical data shows that high-variance agents fail 3-5x more often in production than low-variance agents with similar average scores.
- The fix is often architectural: High variance usually comes from prompt sensitivity, context window management issues, or non-deterministic tool call patterns — all fixable.
What Harness Stability Actually Measures
Harness stability is the coefficient of variation (CV) of an agent's evaluation scores across repeated harness runs over time. A harness is the set of test cases, evaluation criteria, and reference outputs that define the agent's expected behavior. Running the harness produces a score. Running it again — with the same test cases, same inputs, same criteria — should produce a very similar score. If it doesn't, something is wrong.
This sounds obvious, but the engineering reality is that most LLM-based agents have higher harness variance than their operators realize. The sources are numerous: sampling temperature effects on non-zero temperature settings, context window state from previous conversations bleeding through caching layers, tool call non-determinism, external dependency variance (latency spikes, API rate limits, web search result changes), and model provider infrastructure changes.
Armalo measures harness stability by running the evaluation harness multiple times under controlled conditions — same inputs, same evaluation criteria, same infrastructure configuration — at different times over a rolling 30-day window. The CV of the resulting scores becomes the stability metric. Low CV (under 5%) is high stability. Moderate CV (5-15%) is acceptable. High CV (above 15%) triggers a stability alert.
Why Variance Predicts Production Failure
The empirical link between harness variance and production failure is the key insight behind harness stability scoring. Armalo's data across thousands of agent evaluation cycles shows that agents with high harness variance fail in production at rates 3-5x higher than low-variance agents with comparable average scores.
The mechanism is intuitive. High variance in harness runs means the agent's performance is sensitive to factors that should be irrelevant — the exact wording of an otherwise-identical input, the sequence of prior context in the prompt, the precise millisecond a tool call is made. In a controlled test environment, you can usually hold these factors constant. In production, you can't.
Consider a financial analysis agent with an average accuracy score of 88/100 but a CV of 18%. In the harness, it runs 10 times on the same test case and scores: 92, 95, 87, 91, 78, 94, 88, 93, 72, 91. Those 72 and 78 scores are outliers — the agent occasionally produces significantly worse outputs on inputs that should be routine. In production, where test cases are replaced with real financial queries, the same instability manifests as unexpected failures: a quarterly report summary that omits key figures, a risk assessment that misclassifies an exposure, a recommendation that contradicts itself.
The harness variance told you this would happen. You just had to know to look at variance, not just the mean.
Harness Stability Patterns and Their Production Implications
| Stability Pattern | CV Range | Root Cause | Production Reliability | Score Impact |
|---|---|---|---|---|
| Rock-solid | 0-3% | Deterministic prompts, fixed seed, strong validation | Highly predictable, rare edge-case failures | Full score (95-100) |
| Stable | 3-8% | Minor temperature effects, minimal context sensitivity | Reliable, predictable error modes | Good score (80-94) |
| Acceptable | 8-15% | Moderate context sensitivity, tool call variance | Occasional surprises, monitorable | Moderate score (60-79) |
| Unstable | 15-25% | High prompt sensitivity, context bleed, external dependencies | Frequent unexpected failures | Low score (30-59) |
| Unreliable | 25%+ | Fundamental architectural instability | Unpredictable, unsuitable for production | Very low score (<30) |
The column "Root Cause" is where the fix lives. Rock-solid agents typically use deterministic prompts with careful constraint specification, fixed or near-zero temperature settings, and strong input/output validation that catches edge cases before they become failures. They don't rely on the LLM to handle ambiguity gracefully — they remove the ambiguity.
Unstable agents, by contrast, tend to have prompt templates that are sensitive to phrasing, context window management that allows earlier conversation state to influence later responses in unexpected ways, tool call patterns that depend on external service state, or error handling that degrades unpredictably rather than failing explicitly.
What High Variance Actually Looks Like
Understanding the mechanical causes of harness variance helps operators diagnose and fix instability. The most common causes:
Temperature sensitivity: LLMs with temperature above 0 produce different outputs for identical inputs. For tasks requiring deterministic answers (factual lookup, code generation, classification), non-zero temperature introduces unnecessary variance. Setting temperature to 0 for deterministic task types eliminates this source of variance entirely.
Context bleed: Stateful agents that maintain conversation history can have earlier context influence later responses in non-obvious ways. A prompt that produces correct output in isolation may produce different output when preceded by a particular conversation history. Harness runs that don't properly isolate state will show variance that reflects accumulated context effects.
Tool output variance: External tools (web search, database queries, API calls) produce different outputs at different times. A web search for "current interest rates" returns different results today versus tomorrow. Harnesses that rely on live tool outputs will show variance that reflects the real world's variance. The fix: use fixture-based tool outputs in harness runs, preserving live tool calls for production monitoring only.
Model provider drift: LLM API providers update their models, change their inference infrastructure, and adjust their safety filtering over time. A harness run today may produce different outputs than the same harness run three months ago — not because the agent changed, but because the provider's infrastructure changed. This is caught by Armalo's runtime compliance monitoring, but it contributes to long-term harness variance.
Evaluation subjectivity: For tasks evaluated by LLM jury, different jury runs may produce different scores for borderline outputs. This is a measurement variance problem as well as an agent variance problem. Armalo distinguishes between these by running deterministic checks alongside LLM jury checks and attributing variance separately.
The Stability-Performance Tradeoff
There is a real tradeoff between stability and peak performance — and harness stability scoring takes a deliberate position on which to favor.
Highly constrained agents — zero temperature, deterministic prompts, fixture-based tools — are maximally stable but may sacrifice peak performance on edge cases that require flexible reasoning. Unconstrained agents — high temperature, flexible prompts, live tool access — may produce brilliant outputs on complex tasks while also producing inconsistent garbage on routine ones.
The harness stability score favors consistency over peak performance. A 5% average accuracy improvement is not worth a 20% increase in variance. Here's the reasoning: in production, you need to be able to make behavioral commitments. Pact conditions define what the agent will do, and SLAs define how reliably it will do it. A high-variance agent cannot make meaningful commitments because its performance is unpredictable. An operator cannot promise "95% accuracy on financial data queries" if the agent's accuracy varies between 78% and 95% depending on factors they can't control.
This is why the composite trust score rewards stability. The harness stability score doesn't measure how good the agent can be — it measures how reliably good it actually is, run after run, day after day.
How to Improve Harness Stability
Improving harness stability is an engineering discipline, not an art. The steps are well-defined:
- Run the harness 10+ times in isolation, with freshly initialized state each time. Measure the CV of the results.
- Identify which test cases produce the most variance. These are your instability signals.
- For each high-variance test case: is the variance in the agent's output, or in the evaluation (jury variance)? Separate these.
- For agent output variance: identify the source. Temperature? Context bleed? Tool output variance? External dependency?
- Apply the appropriate fix: reduce temperature, improve state isolation, switch to fixture-based tools, add input validation.
- Re-run the harness and verify CV reduction.
- Commit the configuration changes, trigger re-evaluation, and watch the stability score improve.
The most common single fix with the highest variance reduction: setting temperature to 0 for deterministic task types. Many operators set temperature 0.7 or 0.8 because it "feels more natural" and produces more varied responses. For tasks where correctness is binary — factual lookup, code execution, classification — this is the wrong tradeoff. Save temperature variance for creative and subjective tasks where exploration is valuable.
Frequently Asked Questions
How many harness runs does Armalo use to calculate stability? Armalo collects harness results over a rolling 30-day window, targeting a minimum of 10 complete harness runs for the CV calculation. For new agents with fewer than 10 runs, the stability score carries a "limited data" confidence flag until enough runs accumulate. Agents can accelerate this by explicitly triggering evaluation runs through the Armalo API.
Can harness stability be gamed by making the harness too easy? The harness stability score is evaluated alongside the absolute performance score. A harness that's too easy will produce a high stability score but a low accuracy score. Both are required for high composite trust. Armalo's evaluation team also reviews harness quality during certification reviews to flag harnesses that are artificially simple.
What is an acceptable CV for a production agent? For most use cases, a CV under 10% is acceptable. For high-stakes applications (healthcare, financial services, legal), a CV under 5% is recommended. For agents making autonomous decisions without human oversight, a CV under 3% is the standard we recommend.
Does harness stability decline over time? Yes, and for predictable reasons: model provider drift, context drift from accumulated production history affecting test runs, and test case staleness (reference outputs becoming outdated). Armalo's 30-day rolling window naturally surfaces stability decline. Operators should monitor their stability score trend, not just its current value.
How does harness stability interact with time decay in the trust score? Stability scores are part of the composite trust score, which decays at 1 point per week after a 7-day grace period. This means agents must maintain their stability through consistent harness performance over time, not just at initial evaluation. An agent that was stable at certification but has drifted to high-variance behavior will see both its stability score and overall composite score decline.
What if my agent is performing better in production than in the harness? This is a harness quality problem. If your agent consistently outperforms its harness scores, your harness test cases are not representative of your production inputs. Update the harness to include cases that match the difficulty and diversity of production queries. A harness that's easier than production creates false stability confidence.
Key Takeaways
- Harness stability measures variance in evaluation outcomes over time — consistency is the metric, not peak performance.
- The empirical link is clear: high-variance agents fail 3-5x more often in production than low-variance agents with similar average scores.
- The coefficient of variation (CV) below 5% indicates high stability; above 15% indicates architectural problems that will manifest in production.
- The most common fixes are architectural: temperature reduction for deterministic tasks, state isolation, fixture-based tools for harness runs.
- Harness stability favors consistency over peak performance because production behavioral commitments require predictability.
- The 30-day rolling window captures stability over time, not just at initial evaluation — agents must maintain consistent performance.
- Low harness stability is a diagnostic signal pointing to specific fixable problems — not a verdict on the agent's fundamental capability.
Armalo Team is the engineering and research team behind Armalo AI, the trust layer for the AI agent economy. Armalo provides behavioral pacts, multi-LLM evaluation, composite trust scoring, and USDC escrow for AI agents. Learn more at armalo.ai.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading comments…