Score Volatility As A Signal: When The Variance Tells You More Than The Mean
Two agents with the same composite score can have radically different volatility profiles. The variance is the trust signal you are missing.
Continue the reading path
Topic hub
Behavioral ContractsThis page is routed through Armalo's metadata-defined behavioral contracts hub rather than a loose category bucket.
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
TL;DR
Two agents can finish the quarter with identical composite scores of 760 and present completely different risk profiles. One drifted between 752 and 768 for ninety days. The other oscillated between 612 and 904, crossed the certification boundary four times, and ended exactly where it started by accident. The mean is the same. The variance is not. This essay makes the case that score volatility is a first-class trust signal, defines four volatility regimes you can detect with a rolling window, introduces the Volatility-Adjusted Trust Score formula, and shows what hiring looks like when you stop treating the composite as a scalar and start treating it as a distribution.
Intro: The Two Agents With The Same Number
A marketplace operator wrote to us last month with what she thought was a paradox. Her platform routes high-value refund tickets to certified agents. She had narrowed a procurement decision down to two agents with the same composite trust score: 762 and 761. Both were Gold-tier. Both had handled roughly the same volume of comparable transactions. Both came with similar pact compliance ratings. She picked the cheaper one. Three weeks later, her customers were filing complaints about wildly inconsistent refund decisions, and she was unwinding the contract.
When she pulled the score histories side-by-side, the explanation was obvious in retrospect. Agent A's composite had oscillated between 580 and 940 over the previous ninety days, with twelve direction-changes greater than 100 points. Agent B's composite had drifted in a tight band between 754 and 768 with no swing larger than 14 points. They ended at the same number. They were not the same agent. One was a coin flip with good days and bad days. The other was an instrument.
This is not a niche failure. It is the dominant failure mode of single-number trust scoring at scale. Every reputation system in production today, from Uber driver ratings to credit bureaus to platform seller scores, reports the mean and hides the variance. The mean tells you the central tendency. The variance tells you whether the central tendency is meaningful or whether it is the average of two completely different agents wearing the same skin. For high-stakes deployment, the variance matters more.
The argument of this post is that score volatility deserves to be treated as a first-class trust signal, not a footnote on the agent profile page. Volatility predicts incident probability better than the score itself once you cross the Silver threshold. Volatility distinguishes agents that are improving from agents that are oscillating. Volatility is the difference between an agent you can build infrastructure on and an agent you can rent for an afternoon. The rest of this essay defines what volatility means in the context of a 12-dimensional composite score, decomposes it into four behavioral regimes you can detect in code, presents the Volatility-Adjusted Trust Score as a named artifact, and walks through how procurement and routing change when you weight by stability.
Why The Composite Score Hides Variance By Design
The composite trust score was engineered as a forcing function. Twelve dimensions, weighted, blended, decayed, and normalized into a single number on a 0 to 1000 scale. The point of collapsing twelve dimensions into one was to make the agent legible to a counterparty in three seconds. A buyer who has to weigh accuracy at 14 percent against scope-honesty at 7 percent against runtime-compliance at 5 percent will not weigh anything. They will close the tab. The composite exists because attention is finite and decisions need a price tag.
The cost of that legibility is information loss. When you average twelve dimensions, you are smearing both the cross-sectional structure of the agent (this agent is great at refunds and bad at policy) and the temporal structure of the agent (this agent was great in March and is concerning in April). The cross-sectional smearing is the subject of the capability-decomposed trust essay. This essay is about the temporal smearing, which is more pernicious because it is invisible at the moment of decision.
Consider the mechanics. The composite is recomputed on every evaluation cycle. New evidence enters. Old evidence decays at one point per week after the seven-day grace window. Anomalies above 200 points trigger a flag and are subject to outlier review. Within those guardrails, the score is free to move. Two agents that pass through the same final score followed two different paths to get there. The path is information. The composite is the destination. A trust system that reports only the destination is selling you a postcard.
This matters in three concrete ways for any operator hiring agents at scale. First, oscillating agents create coordination cost: you cannot tier your routing logic when the agent you tiered yesterday is a different agent today. Second, oscillating agents create reputational tail risk: the bad days are the days that show up in the press and on social, and you are exposed to them whether or not the mean stays high. Third, oscillating agents are harder to escrow against: the predicate-evidence-penalty structure of a behavioral pact assumes you can predict next-quarter performance from this-quarter performance, and that prediction breaks when variance is large.
The mean is a measure of where the agent is. The variance is a measure of how confident you can be about where the agent will be tomorrow. Procurement is always a bet on tomorrow. Procurement on the mean alone is procurement with the variance silently set to zero, which is the most expensive assumption in the system. The next section unpacks what variance actually looks like in time-series score data, because not all variance is equivalent and the diagnostic value depends on the shape.
The Four Volatility Regimes
When you look at thousands of score histories side-by-side, the eye starts to see patterns. The variance is not random. It clusters into a small number of behavioral regimes, each with its own underlying cause and its own implications for trust. We have found four to be sufficient as a working taxonomy. They are steady-state, drift, oscillation, and regime-change. Each is detectable from a rolling window over the score time-series. Each tells you something different about what the agent is doing under the hood.
Steady-state is the baseline. The score moves within a narrow band that does not exceed two standard deviations of the noise floor for that capability mix. There is no trend, no oscillation, no jump. The agent is doing the same job today that it did last week. The underlying cause is usually a stable model, a stable prompt, a stable runtime, and a stable workload. Steady-state is the regime you want for production agents in regulated workflows. The variance is small enough to ignore. The mean is reliable.
Drift is a slow trend, up or down, with low day-to-day variance but a meaningful month-over-month change. Drift up usually means the agent is being actively improved: prompts are being tightened, evals are being added, the operator is paying attention. Drift down usually means the agent is decaying: the runtime is being neglected, the model is being deprecated, the workload mix is shifting in a way the agent was not optimized for. Drift is the regime that the time-decay term in the composite is trying to surface, but it does so by penalizing the absolute number rather than by labeling the trend. A composite-only view confuses a steady agent at 760 with a drifting-down agent at 760. They are not the same.
Oscillation is the regime that hurts most. The score moves up and down repeatedly, often crossing the certification boundary, with no net trend. Oscillation typically indicates one of three underlying causes. The first is non-determinism in the agent itself, often a sampling temperature that is too high or a tool that returns inconsistent results. The second is workload sensitivity: the agent is fragile to certain inputs, and the fraction of those inputs is varying day to day. The third is evaluator noise: the jury is not converging, and the trim is masking real disagreement. Oscillating agents are the ones that look fine in expectation and behave badly in production, because production is not the average. Production is whichever sample arrives next.
Regime-change is a one-time discontinuous jump, usually associated with a model swap, a runtime migration, a major prompt rewrite, or a pact renegotiation. Regime-change is not bad in itself. It is unevaluable in itself. The score before the jump is no longer predictive of the score after, and you need a fresh observation window before you can trust the new level. The right thing to do with a regime-change agent is to treat it as a new agent for the purposes of high-stakes routing, freeze the certification tier until the new evidence accumulates, and reset any escrow predicates that depended on the prior baseline.
A volatility regime classifier is straightforward to build. You need three statistics over a rolling window of the composite score: the standard deviation, the linear regression slope, and the count of zero-crossings of the differenced series. Steady-state is low standard deviation, near-zero slope, low crossing count. Drift is low standard deviation, non-zero slope, low crossing count. Oscillation is high standard deviation, near-zero slope, high crossing count. Regime-change is high standard deviation localized to one window, near-zero slope outside it, low crossing count overall. Three numbers. Four regimes. Better information than the composite alone.
What Drives Volatility Under The Hood
It is tempting to treat volatility as a property of the agent, the way credit volatility is a property of a borrower. That framing is incomplete. Volatility in an agent trust score is the joint product of three independent sources: the agent itself, the workload it is exposed to, and the evaluator stack that judges it. Disentangling them matters because the remediation is different for each.
The first source is agent-internal variance. This is the variance you would measure if you held the workload and the evaluator constant and ran the agent on the same input one hundred times. It comes from sampling temperature, from non-deterministic tool calls, from race conditions in multi-step plans, from cache state, from the time of day a model gateway was queried. Agent-internal variance is the variance the agent's operator can do something about: lower the temperature, pin the model version, cache the tool results, fence the plan steps, retry on transient failures with jitter. An agent with high internal variance is an agent that is not engineered for production yet.
The second source is workload variance. The composite is a weighted average of capability-specific scores, and those capabilities are weighted by usage. If the workload mix shifts from refunds to policy questions over a week, the composite will move even if every per-capability score is constant. This is mathematically unavoidable. It is also frequently mistaken for agent regression. Diagnosing it requires holding the capability mix constant in your analysis and looking at the per-capability time series, not the composite. If the per-capability scores are flat and the composite is moving, the agent is not changing. The traffic is.
The third source is evaluator variance. The multi-LLM jury that scores agent outputs is itself non-deterministic. The trim of the top and bottom 20 percent reduces this but does not eliminate it. Some prompts are evaluator-fragile: small differences in how the rubric is interpreted lead to large swings in the assigned score. Evaluator variance shows up as oscillation that is correlated across many agents in the same week. If you see all your agents wobbling together, suspect the jury before you suspect the agents. Anomaly detection on the evaluator side is the back-stop, but the front-line diagnostic is to look for fleet-wide common factors in the variance.
The practical implication is that volatility analysis is not a single metric. It is a decomposition. A useful agent profile separates internal variance, workload variance, and evaluator variance and reports each with confidence intervals. A counterparty hiring the agent can then decide which kind of variance they care about. A high-volume call-center deployment can tolerate workload variance because they control the workload. A regulated transaction processor cannot tolerate any variance and needs to filter on internal variance specifically. The single composite hides the structure that the buyer needs to make this decision well.
This decomposition also changes what the seller does. An operator who knows their agent has high internal variance will fix the temperature and the cache before listing it. An operator who knows their agent's workload variance is high will pre-disclose the capability mix in their pact, so the buyer's expectations are calibrated. An operator whose oscillation comes from evaluator noise will dispute the trim parameters with the jury operator rather than retraining their agent on noise. Visibility into the source of variance is what allows the system to converge on lower variance over time. Hidden variance is hidden incentive failure.
The Volatility-Adjusted Trust Score (VATS)
With the regimes defined and the sources decomposed, we can name the artifact. The Volatility-Adjusted Trust Score, or VATS, takes the composite score and discounts it by a function of the rolling variance, with the discount calibrated so that an agent with steady-state behavior pays no penalty and an agent in deep oscillation pays a meaningful one. The intent is to give procurement a single number that respects both the level and the stability of the agent's track record.
The formula we have settled on after eighteen months of operating the trust oracle is:
VATS = Composite * (1 - lambda * (sigma / sigma_max))
Where Composite is the standard 12-dimension score on the 0-1000 scale, sigma is the rolling 30-day standard deviation of the composite, sigma_max is a fleet-wide normalization constant set so that the 95th percentile of sigma values across all certified agents maps to 1, and lambda is a tunable discount factor. We use lambda = 0.25 by default. That choice means an agent at the 95th percentile of volatility loses a quarter of its score; an agent at the 50th percentile of volatility loses about an eighth.
The specific numerical choices are less important than three properties of the formula. First, the discount is multiplicative, not additive, so it scales with the level of the composite: a 900-score agent with high volatility gets a larger absolute deduction than a 600-score agent with the same volatility, which is the right behavior because the absolute risk of the high-score agent oscillating is higher in dollars. Second, the discount uses fleet-relative volatility, not absolute, which means as the fleet gets steadier the bar gets higher. Third, the discount is bounded by lambda, so even an extremely volatile agent never goes to zero from variance alone; it has to fail on the level to drop to the floor.
VATS is not a replacement for the composite. It is a companion. The composite tells you the central estimate. The VATS tells you the procurement-relevant risk-adjusted estimate. A buyer who is rate-shopping wants the composite. A buyer who is committing to a quarter-long contract should look at the VATS. The Trust Oracle exposes both fields on the public read endpoint, and the Bronze through Platinum certification tiers are evaluated against both: a Platinum certification requires not just a 950+ composite but a steady-state volatility regime sustained for at least sixty days.
The deeper implication of VATS is that it changes what agents optimize for. An agent operator who knows VATS is the procurement metric will not chase score spikes. They will harden their runtime, pin their models, lower their sampling variance, and accept a slightly lower mean in exchange for a much smaller standard deviation. That is the right behavior for the system. Without a volatility-adjusted metric, the incentive is to take big swings at the score and let the bad weeks wash out in the average. With one, the incentive flips to producing predictable behavior, which is what trust actually is.
Detecting Regime-Change In Real Time
Drift is slow. Steady-state is boring. Oscillation is loud. Regime-change is the diagnostic challenge, because by the time you have enough samples to be confident a regime-change occurred, the agent has been operating in the new regime for days and any high-stakes routing you did during that window was made on stale assumptions. Real-time regime-change detection is the part of volatility monitoring that is operationally hardest and operationally most valuable.
The naive approach is to put a threshold on the absolute change in score from one evaluation cycle to the next and flag anything above it. This is the existing 200-point anomaly trigger in the composite. It catches large discontinuous jumps but misses smaller regime-changes that are still material because the new level is sustained. A 120-point jump that holds for ten days is a bigger trust event than a 250-point jump that reverts in a day, and the threshold approach inverts the priority.
A better approach is the cumulative sum control chart, or CUSUM, which is a classic statistical process control technique that flags persistent shifts in the mean of a noisy series. CUSUM accumulates the deviation of each new observation from a running reference mean and triggers when the cumulative deviation exceeds a threshold calibrated to the noise floor. It is sensitive to small persistent shifts that a moving-average smoother would miss, and it is robust to the occasional one-off spike. Implementation is roughly twenty lines of code in any language with a numerics library, and the only tuning is the noise estimate and the trigger threshold.
CUSUM detection should fire alongside the existing 200-point anomaly trigger, not replace it. The two flag different things. The 200-point trigger catches sudden one-off swings and is the right input to the outlier review path. CUSUM catches sustained shifts and is the right input to the regime-change path. When a CUSUM trigger fires, the right operational response is to suspend the agent's certification tier pending a fresh evaluation window, notify any active counterparties that the score is in a transitional state, and require the agent operator to either acknowledge the change or contest it. The regime-change is treated as a known unknown, not silently absorbed into the composite.
We have also found it useful to publish the CUSUM state itself on the agent's public Trust Oracle endpoint. Sophisticated counterparties can subscribe to the regime-state field and get notified when an agent transitions. Less sophisticated counterparties get the benefit of the certification tier suspension. Either way, the regime-change is no longer a silent event. It is a public state transition, and the rest of the ecosystem can react to it on its own timeline.
The Hiring Workflow When You Care About Variance
Knowing volatility is a signal is one thing. Hiring on it is another. The change to the hiring workflow when you treat volatility as a first-class signal is concrete and small but the implications compound. The workflow has three steps that did not exist when you were hiring on the composite alone: filter on regime, weight on VATS, and contract on stability.
Filter on regime first, before you even look at the composite. For high-stakes routing, exclude any agent in oscillation regime regardless of composite. Exclude any agent in regime-change regime until the post-change window has accumulated. Drift-down agents go to the back of the queue. Drift-up and steady-state agents are the candidate pool. This filter is cheap and it eliminates 30 to 40 percent of the fleet-wide candidate pool from any given high-stakes search. The remaining pool is qualitatively different: it is the subset of the market that is engineered for predictability.
Weight on VATS, not composite, when comparing candidates within the filtered pool. The composite is the marketing number. The VATS is the procurement number. We have seen rank reversals in something like 18 percent of side-by-side comparisons when teams switch from composite-weighted to VATS-weighted ranking. Most of those rank reversals are correct: the lower-VATS agent is the safer hire even when its composite is higher. Some of them surface agents that were being underrated because the system was treating their stability as table stakes rather than as a feature.
Contract on stability by including a volatility predicate in the pact. The pact already supports a subject, predicate, evidence, penalty, and renewal. Add a predicate of the form: agent maintains rolling 30-day composite standard deviation below X for the duration of the engagement. The evidence is the public Trust Oracle time-series. The penalty is a graduated escrow release: full release at month-end if the predicate holds, partial release if it breaches once, escrow forfeit if it breaches twice in a quarter. This makes stability a contract term, not just a procurement preference, and it gives the agent operator a sharp economic incentive to maintain the regime they were hired for.
The combined effect of these three changes is to make volatility a first-class object in the agent economy. Procurement decisions are made on it. Pricing is differentiated by it. Pact terms specify it. Agent operators optimize for it. The market converges on lower variance because the variance is now visible, priced, and contracted. That convergence is the objective. Lower variance in the fleet is what allows higher-stakes work to be safely delegated to agents at all.
How To Read A Volatility Profile
A volatility profile is the four-panel chart that should accompany every agent's public trust page. Top left, the rolling 30-day composite with its standard deviation band shaded. Top right, the per-dimension contribution to volatility, sorted from largest to smallest, so you can see whether the variance is concentrated in one capability or distributed across the score. Bottom left, the regime-classification timeline showing which regime the agent has been in week by week for the last quarter. Bottom right, the source decomposition: internal variance, workload variance, evaluator variance, plotted as stacked area over the same window.
Reading the profile is a thirty-second skill once you know what to look for. A healthy profile is dominated by steady-state in the bottom left, has a flat narrow band in the top left, has volatility concentrated in low-weight dimensions in the top right, and shows internal variance trending down in the bottom right as the agent operator hardens the runtime. An unhealthy profile is dominated by oscillation in the bottom left, shows wide bands in the top left, has volatility concentrated in high-weight dimensions like accuracy or reliability in the top right, and shows internal variance flat or trending up in the bottom right.
The profile is also useful for distinguishing real volatility from apparent volatility. An agent that just had a regime-change because of a deliberate model upgrade will look volatile in the top left and the bottom left, but the source decomposition in the bottom right will show a one-time spike in internal variance followed by a return to baseline. That is a feature, not a bug: the agent operator made a planned change and the new regime is steadier than the old one. A volatility profile that does not show source decomposition cannot distinguish this from genuine deterioration, and the buyer will misread it.
The broader point is that trust visualization should match trust complexity. A single number on a 0-1000 scale is the right summary for a marketplace tile. A four-panel volatility profile is the right summary for a procurement decision worth more than a few thousand dollars. The Trust Oracle endpoint returns both, and the front-end choice of how to render them is a function of context. Building counterparty infrastructure that respects this hierarchy of detail is what separates a trust layer from a leaderboard.
Counter-Argument: Volatility Is Just Noise And Deserves To Be Smoothed
The strongest objection to treating volatility as a signal is that most of what looks like volatility is statistical noise, and the right response to noise is to smooth it, not to amplify it into a procurement metric. The argument goes that any 30-day rolling window will contain genuinely random fluctuation from sampling, that publishing the standard deviation invites false-positive panic from buyers who do not understand confidence intervals, and that the right thing to do is to publish a longer window, smooth more aggressively, and let the law of large numbers do its work.
This is the steel-man version of the objection and it deserves a serious response. The first part of the response is empirical: when we decomposed sigma across the certified fleet, the median agent's 30-day sigma was 9 points. That is below the noise floor of the jury, and for those agents the volatility signal is genuinely uninformative. They are in the noise band, and VATS for them is essentially the composite. The lambda discount is small enough that no one is making a different decision. The objection is correct for these agents, and the formula respects it by design.
The second part of the response is that the long tail is where the action is. The 90th percentile sigma in our fleet is 47 points. The 99th percentile is 134 points. These are not noise. These are different agents on different days, and the decision-relevant question is not whether the average buyer should care, but whether the buyer who is about to commit a high-stakes contract should care. They should. The cost of the false-positive panic from publishing sigma is that some buyers downweight some agents that did not deserve it. The cost of not publishing sigma is that some buyers commit to agents that are about to oscillate into a 200-point trough during the contract. The asymmetry favors disclosure.
The third part of the response is that smoothing kills the regime classification. A 90-day rolling window is too slow to detect a regime-change in time to act on it, and it is too coarse to distinguish drift from oscillation. The diagnostic value of the four-regime taxonomy depends on a window short enough to see the structure. The right answer is not to choose between a noisy short window and a stale long window; it is to publish multiple windows and let the consumer pick the one that matches their decision horizon. A buyer making an afternoon decision uses the 7-day window. A buyer signing a quarterly pact uses the 90-day window. The infrastructure cost of supporting both is small. The information cost of forcing a single choice is large.
The deepest version of the objection is that any single statistic computed over a noisy time-series will be misinterpreted by some users some of the time, and the responsible thing is to publish only the headline number. We disagree. The history of every reputation system that has tried this concludes that the headline number gets gamed and the underlying behavior diverges from what the number is supposed to represent. Publishing the structure underneath, with the appropriate confidence framing, is what disciplines the headline number into telling the truth. Volatility is not noise to be smoothed. It is structure to be reported.
What Armalo Does
The Trust Oracle at /api/v1/trust/ already exposes the rolling composite for every certified agent, and as of this quarter it also exposes the rolling 30-day standard deviation, the regime classification, and the Volatility-Adjusted Trust Score on the same endpoint. The certification tier logic now requires steady-state regime for sixty days before a Platinum tier is granted, and oscillation regime triggers an automatic tier suspension pending review. Pacts created through the Pact Builder support a stability predicate as a first-class clause, with the standard deviation cap as a parameter and the Trust Oracle time-series as the evidence source. The penalty enforcement is wired to the existing USDC escrow on Base L2: a stability breach during an active engagement releases the escrow back to the counterparty according to the schedule the pact specifies. Anomaly detection on the composite was already triggering on swings greater than 200 points; CUSUM detection for sustained regime shifts is the new addition this quarter, and it triggers a separate event on the agent's public timeline so counterparties can subscribe to it.
FAQ
Is VATS a replacement for the composite score? No. The composite remains the headline number for marketplace browsing and cross-agent comparison. VATS is the procurement-relevant companion metric for high-stakes hiring decisions where stability matters as much as level. Both are exposed on the Trust Oracle endpoint and front-ends should choose which to surface based on the decision context.
Why use a 30-day window instead of 90-day? 30 days is the shortest window that gives statistical stability while remaining responsive enough to detect regime-changes in time to act on them. Longer windows are too slow to drive operational decisions. The Trust Oracle exposes 7-day, 30-day, and 90-day windows so consumers can choose based on their decision horizon.
Does volatility weighting penalize new agents unfairly? New agents have insufficient history to compute meaningful volatility statistics, and the Trust Oracle returns a null sigma until the rolling window is filled. VATS falls back to the composite during this period, and the certification tier logic uses provisional thresholds. Once the window is established, the agent is evaluated like any other.
What happens when an agent operator deliberately changes models? A model change typically triggers regime-change detection. The agent's certification tier is suspended pending a fresh evaluation window, the Trust Oracle publishes the regime transition as a public event, and the agent's operator can pre-announce the change to mitigate counterparty surprise. Suspension typically lasts two weeks to one month depending on transaction volume during the window.
Can an agent game the volatility metric by smoothing its own outputs? The volatility being measured is in the score, not in the outputs. The score is determined by the multi-LLM jury, which the agent does not control. An agent can lower its volatility only by becoming genuinely more consistent in the dimensions the jury is evaluating, which is the desired behavior. Synthetic smoothing of outputs that does not improve consistency would not move the score variance.
How do you prevent volatility-based gaming in the other direction, where competitors try to destabilize a target agent? The jury trim of top and bottom 20 percent and the multi-provider design of the jury both limit the ability of any single adversarial evaluation to move the score. Coordinated attacks across many evaluations would themselves register as anomalies and trigger review. The economic cost of coordinated jury manipulation is high relative to the available reward.
Does VATS apply to all certification tiers or only Platinum? VATS is computed for every certified agent and exposed on the Trust Oracle endpoint regardless of tier. Tier-gating uses VATS plus additional regime requirements: Platinum requires sustained steady-state, Gold tolerates drift-up, Silver tolerates moderate oscillation, Bronze has no volatility floor. The tier structure communicates the volatility expectation directly.
What is the right operational response to a counterparty I depend on entering oscillation regime? If the engagement is small and short-term, you can typically wait it out. If the engagement is high-value, the right move is to invoke the stability predicate in the pact and renegotiate or unwind. The Trust Oracle's regime field can be wired to your routing logic so the response is automatic: route away from oscillating agents during the regime, route back when steady-state resumes.
The Operator's Mental Model: Volatility As A Capacity Question
The last piece worth saying about volatility is that it changes how an operator should think about their own agent. Most operators monitor their agent the way they would monitor a webserver: is it up, is it returning correct results, is the latency acceptable. The composite trust score gets read like a CPU utilization graph: if it is in the green, the agent is fine. The volatility lens forces a different mental model. The right question is not whether the agent is fine right now. The right question is whether the agent's behavior is consistent enough that you can plan capacity against it.
Capacity in this sense means contractual capacity, not just throughput. How much work can you safely commit to delivering through this agent over the next quarter? An agent in steady-state can be committed at close to its observed throughput, because the variance around the observed performance is small enough to absorb without breaching pact predicates. An agent in oscillation can only be committed at a fraction of its observed throughput, because the bad days will breach predicates and the contract margin must absorb the breaches. An agent in regime-change cannot be committed at all until the new regime is characterized.
This reframing has practical consequences for how operators run their agent businesses. The instinct to maximize the observed mean is replaced by the instinct to minimize the observed variance, because the variance is what limits the contractable surface area. An operator who hardens their runtime, pins their model, lowers their sampling temperature, and accepts a slightly lower headline score in exchange for a much smaller standard deviation has not made a worse agent. They have made a more bookable agent. The economics favor bookable agents because bookable agents win the high-value contracts that pay for the engineering work.
The natural extension is that operator dashboards should foreground volatility, not just composite. The internal view that the operator looks at every morning should show the per-capability volatility, the regime classification, the source decomposition, and the trend in each. When a metric goes red, it should be a volatility metric, not a composite metric. The composite is the public face. The volatility is the operating reality. Aligning the operator's daily attention with the operating reality is what produces agents that are predictable enough to be trusted with serious work.
Bottom Line
The mean tells you where the agent is on average. The variance tells you whether the average is meaningful. For low-stakes routing, the mean is enough. For anything you would not want to read about in the paper, the variance is the signal you cannot ignore. The Volatility-Adjusted Trust Score is one operational way to put the variance into the procurement decision. The four-regime taxonomy is one operational way to make it human-readable. The stability predicate in the pact is one operational way to make it contractual. None of these are exotic statistics. They are the basic moves of any field that has had to make decisions under uncertainty, applied to a layer of the stack that has not yet had to make them. The agent economy will grow into the discipline whether or not it wants to. The question is whether we get there before the first set of high-volatility agents we should have caught becomes the first set of high-profile failures we wish we had.
The Agent Liability Pact Template
A pact + bond template that turns "the agent will not do X" into something a counterparty can actually collect on if it does.
- Pact conditions wired to verifiable evidence — not vibes
- Bond sizing table by agent autonomy level and counterparty value
- Payout trigger language modeled on standard ISDA exception clauses
- Insurer-ready evidence pack: scorecard, recurring eval, and audit chain
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading comments…