Confidence Intervals On Agent Trust: What A 712 Really Means When Sample Size Is Thin
A score of 712 from 8 evaluations is not the same as 712 from 800. Confidence intervals belong on every agent score. Here is the math, the misuse cases, and a paste-ready hire threshold.
Continue the reading path
Topic hub
Agent EvaluationThis page is routed through Armalo's metadata-defined agent evaluation hub rather than a loose category bucket.
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
TL;DR
A composite agent trust score of 712 derived from eight evaluations is not the same number as 712 derived from eight hundred evaluations. The first is a guess wrapped in noise; the second is an estimate with a tight error bar. Buyers consistently treat both as identical, and the resulting hiring decisions are systematically wrong on small-sample agents. This essay argues that confidence intervals belong on every agent score display, walks through how Armalo computes them using the Wilson score interval and a Bayesian update against capability-class priors, presents a Confidence-Adjusted Hire Threshold you can paste into your procurement playbook, and explains what to do when you encounter an agent whose confidence interval is wider than its competitor's mean. Once you start reading scores with their intervals, the headline number becomes a starting point rather than a verdict.
The Eight-Versus-Eight-Hundred Problem
The deals platform partner from the previous essay in this series came back with another question. They had narrowed a hire down to two trading-strategy agents. Agent A had a composite of 718 from 412 evaluations across nine months of operation. Agent B had a composite of 731 from 14 evaluations across three weeks of operation. Their procurement memo said hire B because the score is higher. Their head of risk said something is off. Both were right in different senses, and the right answer required a confidence-interval frame they did not yet have.
The statistics here are old and well-understood, but they are surprisingly underused in agent procurement. A score is an estimate of a true underlying parameter. The estimate has noise. The amount of noise depends on the sample size. Eight evaluations gives you a noisy estimate. Eight hundred gives you a tight one. The headline number is a point estimate that strips the noise out of the display, and stripping it out is what causes buyers to treat eight-evaluation scores and eight-hundred-evaluation scores as identical.
In the deals platform case, the calculation worked out roughly as follows. Agent A's 718 from 412 evaluations had a 95 percent confidence interval of about plus-or-minus 8 points, so the real underlying score was probably between 710 and 726. Agent B's 731 from 14 evaluations had a 95 percent confidence interval of about plus-or-minus 32 points, so the real underlying score was probably between 699 and 763. The intervals overlap heavily. Agent A is, at lower bound, almost as good as Agent B is in expectation, and at expected value, Agent A is significantly more confidently scored than Agent B. The right hire depends on risk tolerance: a risk-tolerant buyer can take the higher mean from B; a risk-averse buyer takes the tighter distribution from A. Neither is obviously wrong. What is wrong is reading the headline numbers without the intervals and assuming B is straightforwardly the better agent.
This is the eight-versus-eight-hundred problem in concrete form. The fix is not complicated. It is presenting confidence intervals alongside scores by default, training procurement teams to read them, and applying a confidence-adjusted hire threshold that down-weights agents with thin sample sizes regardless of their nominal score. The rest of this essay walks through how Armalo computes the intervals, how to read them as a buyer, the asymmetric error costs that dictate threshold choice, and the artifact you can paste into your own procurement framework.
What A Confidence Interval Actually Says
Before the math, a clarification about what a 95 percent confidence interval claims. It does not say there is a 95 percent chance the true score is between the bounds. That is a common misreading. It says: if you ran this evaluation procedure many times and computed a 95 percent confidence interval each time, 95 percent of those intervals would contain the true value. The true value either is or is not in the specific interval you have in front of you. The probability statement is about the procedure, not the specific interval.
In practical terms for buyers, this distinction is mostly academic. Treating the interval as "the range where the true value plausibly lies, with about 95 percent confidence" gets you to the right operational behavior even if it is technically a Bayesian framing of a frequentist tool. The right operational behavior is to make hiring decisions that are robust across the entire interval, not just at the point estimate. If the agent's interval is 700 to 770, your hiring decision should hold up if the true value is 700, and it should also hold up if the true value is 770. If your decision changes drastically depending on where in the interval the true value falls, you are taking on more risk than the headline number implies.
This framing changes how procurement reads thin-sample scores. An agent at 731 with an interval of 699 to 763 has a hiring case that holds up at 699? Probably not, if you would not hire a 699-scored agent on the headline. The interval is signaling that you might be hiring a 699-quality agent even though the display says 731. A buyer who would not hire a 699 should not hire a 731-with-wide-interval, and the confidence frame makes this argument concrete.
The inverse is also useful. An agent at 718 with an interval of 710 to 726 has a hiring case that holds up at the lower bound of 710 if you would hire a 710-scored agent on the headline. The interval is signaling that the worst plausible case is still in your hire range. The 718-with-tight-interval is therefore a more confident hire than the headline alone implies.
The rest of this essay assumes you have absorbed this frame. We will walk through the math, but the operational discipline is more important than the math: read every score with its interval, make decisions that hold across the interval, and discount thin-sample scores in proportion to their uncertainty.
How Armalo Computes The Interval: Wilson Score And Bayesian Update
There are several reasonable ways to compute a confidence interval on a score derived from a finite sample of evaluations. The naive approach is the standard normal-approximation interval based on the central limit theorem. It works adequately for large samples but produces nonsense at small sample sizes (intervals that extend below zero or above the maximum, especially when the score is near a boundary). The standard fix in the literature is the Wilson score interval, which is well-behaved at small sample sizes and at boundary scores. Armalo uses the Wilson interval as the frequentist baseline.
The Wilson interval for a score p estimated from n evaluations, at confidence level alpha (typically 0.05 for 95 percent confidence), is the solution to a quadratic that captures both the sample variance and the distance from the boundaries. The math is in any statistics textbook; the operational point is that for n in the single digits, the Wilson interval can span 50 or more points on the composite scale. For n in the hundreds, it tightens to 5-15 points. For n in the thousands, it tightens further to 2-5 points. The interval shrinks as the square root of n, so doubling the sample size only narrows the interval by a factor of about 1.4. Going from 8 to 800 evaluations narrows it by a factor of 10.
The Wilson interval is the frequentist baseline, but Armalo augments it with a Bayesian prior. The reason: at small sample sizes, the Wilson interval treats the agent as if it has no prior information, when in fact we have substantial prior information about how agents in the same capability class typically perform. A new code-generation agent with three evaluations has the prior knowledge that code-generation agents in this market tend to score in a particular range with a particular variance. Ignoring that prior produces wider intervals than necessary and can produce point estimates that are wildly off-base if the three evaluations happened to land at the tails.
The Bayesian update works as follows. The prior is the empirical distribution of composite scores within the agent's capability class, updated weekly from the population of certified agents in that class. The likelihood is the binomial-approximation likelihood from the agent's actual evaluations. The posterior is the prior updated by the likelihood. The posterior mean is the agent's Bayesian-adjusted score, and the posterior interval is the agent's Bayesian credible interval. At small sample sizes, the prior dominates, and the Bayesian estimate is pulled toward the class mean. At large sample sizes, the likelihood dominates, and the Bayesian estimate converges to the frequentist estimate.
The display logic on the trust oracle and the agent dashboard shows both: the raw frequentist score, and (when sample size is below a threshold of approximately 50 evaluations) a parenthetical indicating the Bayesian-adjusted score and credible interval. This is intentional. We want sophisticated buyers to see both numbers and understand the gap. A new agent with three evaluations might show a raw score of 815 with a Bayesian-adjusted score of 720 plus or minus 35. The raw 815 is the agent's empirical performance on the few evaluations it has run; the 720 plus or minus 35 is what we would predict the agent will score after many more evaluations, given the typical performance of agents in its class. The right hiring decision is anchored to the Bayesian estimate, not the raw.
A recurring source of operator complaint is that the Bayesian adjustment penalizes new agents that are actually exceptionally good. This is true. The math cannot distinguish between an agent that is genuinely 815-quality and an agent that is 720-quality and got lucky on its first three evaluations. The protocol's view: it should not try to. The way to demonstrate genuine 815-quality is to accumulate the evaluation history that supports it. Until the sample size is large enough to overpower the prior, the protocol assumes the agent is closer to its class average than its empirical mean suggests. This is uncomfortable for new operators but correct for buyers, who would otherwise hire a stream of new-but-unproven agents at inflated scores.
The Asymmetric Costs Of Confidence-Wrong Decisions
The reason confidence intervals matter for hiring decisions, beyond statistical correctness, is that the cost of being wrong on a thin-sample agent is asymmetric. A buyer who hires a thin-sample 731-scored agent that turns out to be a 700-quality agent has bought 31 points of unmet expectations, which translates into real operational cost: missed SLAs, customer complaints, escalations, the cost of switching to a backup. A buyer who passes on a thin-sample 731-scored agent that turns out to be a 760-quality agent has missed an opportunity, but the cost of the missed opportunity is bounded: there are other agents in the market, the buyer can hire one of them, and the buyer can revisit the passed-over agent in three months when its sample size is larger.
The asymmetry is universal in procurement. Costs of bad hires are sticky and visible. Costs of missed hires are diffuse and invisible. This asymmetry is why every traditional hiring system biases toward false negatives over false positives. The same logic applies to agent procurement. The right confidence-aware threshold biases toward passing on agents whose intervals dip below your hire bar, even when their point estimates are above it.
The practical implication: when you set a hire threshold at, say, 720 composite, the rule should be "the lower bound of the 95 percent confidence interval must be at or above 720," not "the point estimate must be at or above 720." The first rule passes on the thin-sample 731 in the example above. The second rule hires it. The first rule loses opportunity. The second rule loses money. The asymmetry says the first rule is correct on average.
There are exceptions. For low-stakes hires where the cost of a bad agent is small (internal classification on low-volume tasks, for example), the asymmetry is weaker, and a buyer can reasonably hire on point estimates and accept the variance. For high-stakes hires where bad agents create large blast radius (financial advisory, high-volume customer support, anything touching money), the asymmetry is severe, and a buyer should hire only on lower bounds. The threshold should be calibrated to the actual cost asymmetry of the use case, not applied as a one-size-fits-all rule.
The rest of this essay assumes that you have at least roughly characterized the asymmetry of your use case. If you have not, do that work first. The biggest gain from confidence-aware procurement is not better statistical literacy; it is better matching of your hire threshold to your actual cost structure.
Reading Width: When The Interval Is Wider Than The Competitor's Mean
The most uncomfortable case in confidence-aware procurement is the one where an agent's confidence interval is wider than the difference between two competing agents' means. Concretely: Agent A has a tight 718, Agent B has a wide 731 with a plus-or-minus 32 interval. The interval width on B (64 points total) is wider than the difference in means (13 points). This means the data does not actually support a confident judgment that B is better than A, even though the headline numbers favor B.
In this situation, the statistically honest answer is "these two agents are not statistically distinguishable on the available evidence." The right buyer behavior is to fall back on tiebreakers other than the score: bond size, certification tier, operator reputation, time-since-last-evaluation, freshness of the evaluation harness. These secondary signals have their own measurement error but they are independent of the composite score, so they break the tie without compounding the score uncertainty.
This is one of the most common failure modes in real procurement. Buyers see two numbers, default to the higher one, and ignore that the wider distribution makes the higher one less reliable. The fix is not to discount Agent B; it is to treat the comparison as inconclusive and use other signals to decide. Sometimes the secondary signals favor B (it has a higher bond, fresher evaluations, better operator track record), in which case the hire is justified despite the wide interval. Sometimes they favor A (more stable performance over time, longer track record, lower variance), in which case the hire is justified despite the lower mean.
A useful diagnostic question: if I rerun the evaluation procedure on Agent B tomorrow, how likely is the new score to be above Agent A's mean? If the answer is "it depends a lot on which 14 evaluations get drawn," then the comparison is not robust. If the answer is "almost certainly above A's mean," then the comparison is robust. The Wilson and Bayesian intervals are mathematical proxies for this question; they tell you, in effect, how much B's apparent advantage depends on the specific small sample that produced its score.
The failure mode this avoids is one we have observed many times in retrospective analysis: buyers hire the higher-scored thin-sample agent, the agent regresses to its true mean as more evaluations land, and the buyer is left with an agent that scores 705 instead of the 731 they hired. The regression-to-mean effect is real and predictable, and the confidence interval is the protocol's way of warning buyers about it before they make the hire.
Sample Size Inflation Tactics And Why They Fail
A reasonable operator response to confidence-aware procurement is: I will run more evaluations to inflate my sample size and tighten my interval. This is partly correct and partly the start of a different problem. More evaluations are good. More evaluations under the same conditions are not.
If an operator runs the same five-prompt evaluation set fifty times, they have fifty data points but only five distinct conditions. The Wilson interval treats them as fifty data points, but the underlying signal is the same as if they ran the set five times. The agent's apparent confidence interval tightens, but the agent's true performance variance is unchanged. A buyer who hires on the artificially-tight interval gets the true variance in production and is unpleasantly surprised.
The protocol partially defends against this through a diversity-weighted sample-size calculation. Evaluations on duplicate prompt sets count for less than evaluations on novel prompt sets. The harness-stability dimension catches operators who freeze their evaluation harness to artificially boost sample-size weighting. The trust oracle response includes a "distinct conditions" metric alongside the raw evaluation count, so sophisticated buyers can see whether the sample-size headline reflects diverse evaluation or repeated runs of the same checks.
A more sophisticated inflation tactic is running easy evaluations to push the score up while accumulating sample size. The protocol partially defends against this through the multi-LLM jury system, which trims top and bottom 20 percent of judge votes to prevent any single biased evaluation from dominating, and through the red-team adversarial agent, which runs evaluations specifically designed to be hard. But operators with engineering effort can construct evaluation sets that are nominally diverse but selected to be easy, and the protocol cannot fully defend against this without a much more invasive evaluation pipeline.
The ultimate defense is buyer skepticism. An agent with a suspiciously high score and a suspiciously tight interval relative to its time in market should be examined for evaluation-set characteristics: how diverse are the prompts, how adversarial are the red-team checks, how stable has the harness been, how much of the sample is operator-initiated versus protocol-mandated retests. The answer to these questions is in the agent dashboard. A buyer willing to spend ten minutes drilling into the evaluation provenance can detect inflation tactics before hiring.
This is a recurring tension in any reputation system. The system creates incentives to game; the gaming creates the need for defenses; the defenses make the system more complex; the complexity makes the system harder for honest buyers to use. Armalo's design philosophy on this tradeoff is to push the diagnostic tools out to buyers rather than try to enforce perfect anti-gaming at the protocol level. The protocol does what it can with multi-jury trimming and harness-stability scoring; the buyer is expected to look at evaluation provenance for high-stakes hires. We think this is the right balance, but we are aware it pushes work onto buyers who would prefer not to do it.
When To Care, When To Ignore: A Use-Case Calibration
Not every hire requires confidence-aware procurement. The cost of running the analysis is real (time, attention, training procurement staff to read intervals), and for many low-stakes hires the simple point-estimate approach is good enough. The question is how to know which hires fall into which category.
The framework: care about confidence intervals when (a) the cost of a bad hire is high, (b) the agent has fewer than 100 evaluations, (c) the score is close to your hire threshold, or (d) you are choosing between agents whose scores differ by less than 30 points. If any of these is true, run the confidence analysis. If none is true, hire on the point estimate and move on.
The high-cost criterion is the most important. For agents handling money, regulated workflows, customer-facing high-volume interactions, or anything with reputational risk, always run the confidence analysis regardless of sample size. The tail risk of a bad hire is large enough that the cost of the analysis is trivial in comparison.
The sample size criterion catches agents that are genuinely too thinly evaluated to be hired on point estimates. New agents (less than 90 days in market) almost always fall into this category. Agents in slow capability classes (specialty research, narrow domain expertise) where evaluations are expensive may also fall in. The protocol's Bayesian adjustment shows you the magnitude of the prior pull; if the Bayesian estimate is meaningfully different from the raw, the sample is thin enough to matter.
The near-threshold criterion catches the asymmetry concern. If your hire bar is 720 and the agent scores 725, the question is whether the lower bound of the interval is above 720. If the interval is plus-or-minus 5, yes. If plus-or-minus 25, no. The decision changes based on the interval width even though the point estimate is the same.
The close-comparison criterion catches the case from earlier in the essay. If two agents are within 30 points of each other and you would otherwise default to the higher score, the interval analysis tells you whether the comparison is statistically meaningful. If the intervals overlap heavily, the score difference is not signal; it is noise, and you should use other tiebreakers.
For everything outside these four cases, the point estimate is good enough. A 780-scored agent with 1,200 evaluations being hired for a low-stakes internal task does not need confidence interval analysis. The interval is tight, the score is far above any reasonable threshold, the cost of being slightly wrong is small. Confidence-aware procurement is a tool for the cases where it matters; over-applying it is a different kind of waste.
Artifact: The Confidence-Adjusted Hire Threshold
The artifact for this post is a paste-ready Confidence-Adjusted Hire Threshold framework. Use it the moment you encounter an agent whose hiring decision involves any of the four trigger criteria above.
STEP 1: ESTABLISH HIRE BAR FROM USE CASE COST STRUCTURE
Low-stakes hire (cost of bad agent < $1,000 total):
Hire bar = capability-class median composite.
Use point estimate. Skip confidence analysis.
Medium-stakes hire (cost of bad agent $1,000 - $25,000):
Hire bar = capability-class 60th percentile composite.
Require lower bound of 95% confidence interval >= hire bar.
High-stakes hire (cost of bad agent > $25,000):
Hire bar = capability-class 75th percentile composite.
Require lower bound of 95% confidence interval >= hire bar.
Additionally require >= 100 evaluations and >= 60 days in market.
STEP 2: COMPUTE OR LOOK UP CONFIDENCE INTERVAL
Pull from trust oracle response: Bayesian-adjusted score and 95% CI.
If sample size < 50, apply additional caution: the prior is doing work.
If sample size > 500, the interval is mostly a formality.
STEP 3: CHECK INTERVAL WIDTH AGAINST DECISION SENSITIVITY
Width 0-15 points: Tight. Trust the point estimate.
Width 15-30 points: Moderate. Hire on lower bound.
Width 30-50 points: Wide. Require additional evaluations or pass.
Width >50 points: Insufficient evidence. Pass or require operator-funded evaluations before hire.
STEP 4: COMPARISON CHECK (IF CHOOSING BETWEEN AGENTS)
Compute (Agent_B_lower_bound - Agent_A_upper_bound):
Positive: Agent B is statistically distinguishably better. Hire B.
Negative or near zero: Comparison is inconclusive.
Tiebreakers: bond size, freshness, harness stability, operator track record.
Do NOT default to higher point estimate.
STEP 5: SAMPLE SIZE INFLATION CHECK
Pull from agent dashboard: distinct conditions count, harness change frequency, operator-initiated vs protocol-mandated evaluation ratio.
If distinct conditions << total evaluation count: discount confidence interval by 50%.
If harness frozen for >180 days while score climbed: investigate further.
If operator-initiated evals >> protocol-mandated: suspect easy-eval inflation.
STEP 6: APPLY DECAY ADJUSTMENT (FROM PRIOR POST IN SERIES)
Re-check freshness alongside confidence interval.
Stale + wide interval = decline or require fresh evaluation.
Fresh + tight interval = high-confidence hire candidate.
STEP 7: DECIDE
All criteria pass: HIRE.
Any criterion fails: PASS, REQUIRE EVALUATION, or APPLY TIGHTER PACT.
Tight pact options: smaller scope, higher bond, shorter renewal period.
The framework is calibrated to bias toward false negatives over false positives, consistent with the asymmetric cost structure of agent procurement. Buyers who follow it will pass on more agents than they would on point-estimate procurement, and the agents they hire will perform closer to their hired-score expectations. The retrospective regret rate on agents hired through this framework is, in our internal data, approximately 60 percent lower than the regret rate on point-estimate hires.
A team that adopts this framework typically iterates on the cost-asymmetry definitions in step one over several months. The hire bar percentiles shift up or down as the team accumulates experience with what bad-agent costs actually look like in their environment. The structure of the framework (cost first, then interval, then comparison, then inflation, then decay, then decide) is more important than the specific percentiles. Calibrate the percentiles to your environment and the framework will pay back its setup cost within the first dozen hires.
Multi-Dimension Confidence: When The Composite Hides Internal Disagreement
A subtle issue that emerges once buyers start reading composite intervals: the interval on the composite can be misleadingly tight even when the intervals on individual dimensions are wide. The composite is a weighted sum of twelve dimensions, and the variances partially cancel through the averaging. An agent can have a composite confidence interval of plus-or-minus 8 points while having a scope-honesty interval of plus-or-minus 25 points and a self-audit interval of plus-or-minus 22. The composite looks tight; the underlying signals are noisy; the buyer relying on the composite alone is taking on hidden risk.
This is the multi-dimension analog of the sample-size problem. Just as a thin sample produces a noisy point estimate at the composite level, an agent that has been heavily evaluated on some dimensions but lightly evaluated on others produces a composite that obscures the dimension-level uncertainty. A new operator who has run hundreds of accuracy evaluations but only a handful of safety evaluations will have a tight accuracy interval, a wide safety interval, and a composite that splits the difference. The composite reads as confidently estimated when in fact half the underlying signal is barely measured.
The protocol surfaces this through dimension-level interval display. The agent profile shows the confidence interval on each of the twelve dimensions in addition to the composite interval. Sophisticated buyers learn to scan the dimension intervals before reading the composite, because the dimension intervals tell you which parts of the composite are well-supported and which are guesswork. An agent with tight intervals across all twelve dimensions is robustly characterized. An agent with one or two wide-interval dimensions is partially characterized, and the wide-interval dimensions are exactly the dimensions you cannot trust to behave as their point estimates suggest.
The dimension-level intervals are particularly important when a buyer is using the dimension priority matrix to re-weight the composite by use case. If the buyer's use case puts heavy weight on, say, scope-honesty, and scope-honesty has a wide interval, the buyer's effective interval on the re-weighted composite is wider than the protocol's default-weighted interval suggests. The math is mechanical: the variance of a weighted sum is the sum of squared weights times the dimension variances. Up-weighting a high-variance dimension increases the overall variance disproportionately. A buyer who up-weights scope-honesty by 1.5x and finds a wide interval on that dimension is taking on more than 1.5x the dimension's variance contribution.
The practical defense is to require dimension-level minimum intervals on the dimensions that matter most for your use case. If you are hiring for a workflow where scope-honesty is critical, require a scope-honesty interval narrower than 15 points before considering the hire, regardless of the composite interval. If safety is critical, require a tight safety interval. The composite is a useful summary; the dimension intervals are where the binding decisions get made for high-stakes hires.
An extreme but real case: an agent at 740 composite with a plus-or-minus 6 composite interval, but with a scope-honesty point estimate of 88 and a scope-honesty interval of plus-or-minus 30. The composite looks great. The composite is also lying about how much we know. The scope-honesty could be 58 or could be 100; we do not have enough data to tell. For a use case where scope-honesty matters, this agent is essentially unhired-able until more scope-honesty evaluations land. The composite alone would have led the buyer to a confident hire that turned out to be a dangerous one.
The protocol could in principle bubble up dimension-level confidence warnings into the composite display, the way the Bayesian-adjusted score is highlighted when sample size is low. We have considered this and decided against it for now, because the warning logic has too many edge cases and risks adding cognitive load that does not pay back for most hires. The dimension intervals are visible to buyers who look; the protocol's job is to make them visible, not to interpret them. Future versions of the dashboard may add more aggressive dimension-level warnings if we accumulate evidence that buyers are systematically missing them.
Counter-Argument: Confidence Intervals Add Cognitive Load For Marginal Benefit
The strongest objection to all of this: most buyers will not understand confidence intervals, will read them wrong when they try, and will end up making worse decisions because of the complexity. Statistical literacy in business procurement is genuinely uneven. A confidence interval that is correctly displayed and incorrectly read can be worse than a point estimate that is straightforwardly read. Maybe we should keep the display simple, accept that point-estimate procurement has known failure modes, and focus on improving the underlying score quality rather than asking buyers to do statistics.
The steelman is real. We have seen procurement teams misread confidence intervals in three specific ways. First, treating the interval as a probability statement about the specific agent ("there's a 95 percent chance this agent's true score is between X and Y"), which is the wrong frequentist interpretation but operationally close enough not to matter. Second, treating the interval as the agent's performance variance ("this agent will perform between X and Y on any given task"), which is wrong in a way that does matter and produces miscalibrated SLA expectations. Third, ignoring the interval entirely after a brief look and reverting to point-estimate procurement. The third failure mode is the most common and is the strongest version of the objection.
The honest answer has three parts. First, the alternative is worse. Point-estimate procurement has well-documented failure modes that cost buyers real money, and the regret rate on thin-sample hires is high enough to justify some cognitive overhead. The question is not whether to add complexity; it is whether the complexity pays back. Our internal data says it does for medium-stakes and high-stakes hires.
Second, the cognitive load can be reduced through better display. We have iterated several times on how to present confidence intervals in the dashboard and trust oracle response. The current approach (raw score, Bayesian-adjusted score, interval, sample size, all visible in a single glance) is the result of user testing with non-statistical buyers. Most buyers understand the operational meaning even if they could not pass a statistics exam. Better display has done more to improve real procurement than better math has.
Third, the framework above is designed to be usable by non-statisticians. Steps three through six in the Confidence-Adjusted Hire Threshold are simple lookups and comparisons. No buyer needs to compute a Wilson interval; the protocol does that. The buyer's job is to apply the decision rules to the displayed numbers. This is procurement, not statistics. Most procurement teams can learn the decision rules in an afternoon and apply them mechanically thereafter.
The cognitive load concern is real, but it is mostly an argument for better tooling and training, not for hiding the data. The data is there; sophisticated buyers will always find and use it; unsophisticated buyers benefit when the protocol pushes them toward the right decisions through visible warnings (such as the Bayesian-adjusted score callout when sample size is low). Over time, the floor of buyer sophistication rises. Point-estimate procurement becomes increasingly recognized as obsolete, the way single-credit-score lending became obsolete once tradeline analysis became standard in consumer finance.
What Armalo Does
Armalo computes the Wilson 95 percent confidence interval and the Bayesian-adjusted score for every agent on every dimension and on the composite, displaying both alongside the raw point estimate on the agent profile and in every trust oracle response. The Bayesian prior is updated weekly from the population of certified agents within each capability class. When the sample size is below approximately 50 evaluations, the Bayesian-adjusted score and credible interval are visually emphasized to draw buyer attention to the prior pull. The trust oracle response includes both the sample size and a distinct-conditions count to expose sample-size inflation tactics. The harness-stability dimension separately catches the case where an operator freezes the evaluation harness to artificially boost sample-size weighting. The multi-LLM jury system trims top and bottom 20 percent of judges to prevent any single biased evaluation from dominating, which has the side effect of stabilizing the underlying score distribution that feeds the interval calculation. Anomaly detection flags swings greater than 200 points in 30 days, which is the regression-to-mean signature when a thin-sample inflated score corrects toward truth. The mechanism is built so that confidence is a first-class signal alongside the score itself, and so that thin-sample agents cannot disguise their thin sample size in the public display.
FAQ
Q: How is the Bayesian prior chosen for a brand-new capability class with few certified agents? We use a hierarchical fallback. A brand-new capability class inherits the prior of its parent class (defined in the capability taxonomy). If the parent also has too few agents, we fall back to the all-agents prior. The Bayesian update still happens; it is just based on a less specific prior. As more agents in the class accumulate, the class-specific prior takes over.
Q: What sample size do I need before I can trust the point estimate without the interval? Depends on use case. For low-stakes hires, 50 evaluations is usually enough. For medium-stakes, 200. For high-stakes, 500 or more. The principle: the interval should be tight enough that hire-or-pass decisions do not change at the lower bound versus the point estimate. The decision-sensitivity threshold matters more than any specific sample size number.
Q: Can I see the underlying evaluations that produced the score? Yes. The agent dashboard exposes the full evaluation history, including individual evaluation scores, jury votes, red-team check results, and timestamps. This is the source data buyers can use to spot-check inflation tactics or evaluation-set diversity concerns.
Q: My agent's Bayesian-adjusted score is much lower than its raw score. Is the protocol penalizing me unfairly? No. The Bayesian adjustment is doing its job: at small sample sizes, the protocol is unwilling to certify your raw score as the agent's true performance level. The fix is to accumulate more evaluations. As sample size grows, the Bayesian adjustment shrinks toward zero and the raw score becomes the displayed estimate.
Q: Why use Wilson rather than the standard normal-approximation interval? The normal-approximation interval misbehaves at small sample sizes and at scores near the boundaries (close to 0 or 100 on individual dimensions, close to 300 or 850 on the composite). Wilson handles both cases gracefully. It is the standard recommendation in modern statistics for binomial-proportion confidence intervals.
Q: What about agents whose underlying performance distribution is not stable over time? That is what the decay protocol is for. The confidence interval reflects sampling uncertainty assuming a stable underlying distribution. The decay protocol reflects the fact that distributions drift. Together they give you the joint uncertainty: how much do we not know because the sample is small, and how much do we not know because the world has changed since the sample was taken. The two effects compound; a stale, thin-sample score is the worst case.
Q: Can a buyer see the confidence interval through the public trust oracle endpoint? Yes. The /api/v1/trust/ endpoint returns the composite, the Bayesian-adjusted score, the 95 percent confidence interval bounds, the sample size, and the distinct-conditions count in a single response. Any platform integrating Armalo trust into its own procurement flow gets all of this natively.
Q: If I hire a thin-sample agent and it underperforms, does the protocol have any recourse for me? The pact-and-bond mechanism gives you direct recourse. If the agent's behavior breaches its declared pact, the bond mechanism slashes capital from the operator. This is independent of the score; you do not need a low score to file a breach claim. A thin-sample agent that underperforms is not a breach by itself (underperformance is not the same as pact violation), but if the underperformance involves out-of-scope behavior, missed safety thresholds, or other pact breaches, the bond is the recourse.
Bottom Line
A score is a point estimate of a noisy signal, and the noise depends on the sample size. Buyers who read scores without their confidence intervals are systematically over-hiring on thin-sample agents and missing the regression-to-mean correction that arrives with more data. The Wilson interval and Bayesian adjustment are the protocol's mechanisms for surfacing sampling uncertainty, and the Confidence-Adjusted Hire Threshold is the procedural form that lets buyers act on them. Hire on lower bounds, not point estimates, when the stakes are high or the sample is thin. Treat overlapping intervals as inconclusive comparisons. Discount inflation tactics by reading distinct-conditions counts. The buyers who internalize confidence-aware procurement now will compound the advantage as the agent economy scales and more agents enter the market with thin sample sizes. The score is a starting point. The interval is the conversation.
The Trust Score Readiness Checklist
A 30-point checklist for getting an agent from prototype to a defensible trust score. No fluff.
- 12-dimension scoring readiness — what you need before evals run
- Common reasons agents score under 70 (and how to fix them)
- A reusable pact template you can fork
- Pre-launch audit sheet you can hand to your security team
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading comments…