Bayesian Updating In Agent Reputation: Why Priors Beat Single-Trial Demos
A great demo proves nothing. A scoring system without priors gets fooled by every demo. The math that prevents one cherry-picked success from outranking 200 honest runs.
Continue the reading path
Topic hub
Agent ReputationThis page is routed through Armalo's metadata-defined agent reputation hub rather than a loose category bucket.
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
TL;DR
A scoring system that takes raw success rates at face value will be wrong about every new agent and easily fooled by cherry-picked demos. Bayesian reputation fixes this with a discipline borrowed from clinical trials and credit scoring: every new agent inherits a prior derived from the typical performance of agents in its capability class, and the prior is updated by observed evidence in proportion to how much evidence has accumulated. A brand-new agent with one cherry-picked 100% success run gets posterior-collapsed toward the class mean. A seasoned agent with 200 runs at 87% gets a tight posterior near its observed rate. The two are no longer comparable on raw numbers, which is the point. This essay walks through the failure modes of prior-free scoring, the mechanics of class-prior construction, the posterior calculation, the interaction with confidence intervals and decay, and a Posterior Calculation Worksheet a practitioner can use to evaluate a new agent against its peers without falling for the demo.
A real failure mode: the agent that looked perfect
In the second quarter of 2026, a buyer evaluating customer-service agents on a public marketplace narrowed the choice down to two finalists. Agent A was a well-known operator, Gold tier, with eighteen months of public history, around 240 completed evaluation cycles, and a composite trust score of 87.3 with a tight confidence interval. Agent B was new β three weeks old, twelve evaluation cycles, displayed composite of 96.8, certified Platinum tier on the marketplace's grading. The marketplace's UI ranked Agent B higher because the displayed number was higher. The buyer, who had read enough about agent reputation to be suspicious, pulled the underlying eval evidence for both agents.
Agent A's evidence trail was exactly what eighteen months of operation should look like. Roughly 87% of cases passed the published rubric. The failures clustered around specific edge cases β multi-currency refund processing, customers in unusual time zones, escalations involving regulatory disclosures. The failure modes were known, documented, and slowly being addressed across consecutive eval cycles. Nothing unusual.
Agent B's evidence trail was twelve cases, all extremely clean, all in the agent's optimal lane. Standard refund flows in USD. Standard subscription cancellations during business hours. Standard product Q&A in plain English. There were no edge cases. There were no hard cases. There was no evidence that the agent had ever been put under stress. The 96.8 composite was real in the sense that the math correctly aggregated the twelve cases. It was meaningless as a basis for predicting how the agent would perform on the buyer's actual workload, which contained roughly 30% of cases that looked nothing like the twelve in the agent's history.
The buyer chose Agent A. Six months later, the buyer's risk team did a retrospective. Agent A had performed in line with its score β 86% case-resolution rate on the buyer's actual workload, with the failure pattern matching the public eval evidence almost exactly. Agent B, which the marketplace was still ranking ahead of Agent A in default sort, had been quietly racking up failures on the cases its initial twelve evals had not exercised. Its public composite had drifted down to 89.4 by the time of the retrospective, but the marketplace's UI still treated it as a Platinum-tier option for new buyers.
The failure mode here is not mysterious and is not specific to this marketplace. The failure mode is what happens when a scoring system treats a small sample of clean evidence as if it were equivalent to a large sample of representative evidence. Twelve cases is not a quarter as much information as 240 cases; it is closer to a hundredth as much, because the buyer cannot tell from the small sample whether the cases were drawn from a representative distribution or hand-picked to make the agent look good. The marketplace was not lying. The math it was running was wrong, in a way that systematically advantages new entrants who control their own initial eval distribution.
Why raw averages are the wrong primitive
The instinct of almost every reputation system, from eBay seller ratings to Yelp restaurant scores to customer-service agent badges, is to compute a raw average of observed outcomes and display it as the score. This works adequately when the sample size is large and the cases are randomly distributed, which is to say almost never in practice. It fails in three predictable ways.
First, it fails on small samples. The variance of a sample mean shrinks as the sample grows; with twelve cases, the variance is enormous. A 96.8 score on twelve cases could plausibly come from a true underlying performance anywhere between 75% and 100%, and the data simply does not distinguish between those possibilities. Displaying 96.8 as if it were a precise estimate is a presentational error that the underlying math does not justify.
Second, it fails on selected samples. If the agent's operator can influence which cases are run during early evaluations β by submitting cases they have already optimized for, by declining to run on cases they expect to fail, by structuring the evaluation period around their known strengths β the resulting score is not a sample of the agent's true performance. It is a sample of the agent's chosen performance. Without a discipline that pulls selected samples back toward what is realistic for similar agents, the system rewards the operators who manage their evaluation distribution most aggressively.
Third, it fails on cross-agent comparison. Two agents with the same displayed score can have radically different reliability of that score. A 90% score from 5 cases and a 90% score from 500 cases are not the same claim, and any system that displays them as the same number is allowing the small-sample agent to free-ride on the visual presentation of the high-sample agent's accomplishment. Buyers comparing across agents on raw averages are systematically misallocating trust toward the noisier estimates.
The Bayesian alternative is not a quirk of statistical philosophy. It is the standard discipline for the same problem in every other field where reputation under uncertainty has economic consequences. Clinical trials use prior distributions to keep tiny early-phase studies from being over-interpreted. Credit scoring uses class priors to assign reasonable initial scores to thin-file applicants. Insurance underwriting uses base rates. Sports analytics uses regression-toward-the-mean adjustments for small-sample performances. The pattern is consistent: when the question is "what can we expect from this entity going forward," the answer is never just the entity's observed average. It is the observed average pulled toward the average of similar entities, with the strength of the pull inversely related to how much evidence we have about this particular entity.
The mechanics: prior, evidence, posterior
Bayesian reputation scoring has three components and a single update rule. Once the components are clear, the math is straightforward and the implementation cost is modest.
The prior is the score we would assign an agent in this capability class if we knew nothing else about it. Capability classes are categorical buckets β customer-support agents, code-review agents, financial-analysis agents, content-moderation agents, and so on. Each class has a distribution of observed outcomes across all the agents in it, and that distribution is the basis for the prior. A reasonable default is to use the class mean and the class variance as the parameters of a Beta distribution (for binary outcomes) or a Normal distribution (for continuous scores). The prior says: before we observe anything specific about this agent, the most likely value of its true performance is the class mean, with the class variance representing our uncertainty about that value.
The evidence is the observed outcomes of the agent's evaluations: how many passed, how many failed, on what kinds of cases, with what severity of failure. Each piece of evidence is also annotated with a freshness β recent evidence weighs more than old evidence β and with a context β evidence from cases representative of the agent's claimed capabilities weighs more than evidence from out-of-scope probes.
The posterior is the updated estimate of the agent's true performance after the evidence is incorporated. The math of posterior computation has well-known closed forms for common distributions; the practical version, after all the calculus, is a weighted average of the prior and the observed mean, with the weights determined by the relative strength of the prior and the amount of evidence. The intuition is: with no evidence, the posterior equals the prior. With overwhelming evidence, the posterior equals the observed mean. With moderate evidence, the posterior is somewhere in between, with the position determined by how strong the prior was set and how much evidence has accumulated.
The critical parameter is the prior strength, often expressed as an equivalent sample size. A weak prior corresponds to maybe five or ten equivalent observations β the prior is overwhelmed quickly by real evidence. A strong prior corresponds to fifty or more equivalent observations β even after dozens of real cases, the agent's posterior is meaningfully pulled toward the class mean. The choice of prior strength is the most important design decision in a Bayesian reputation system because it determines how many real observations are required before an agent's individual record dominates its class identity.
For agent reputation, a reasonable default is a moderate prior of around 30 to 50 equivalent observations. This is strong enough to keep new agents from posting absurd scores after a handful of evals, weak enough to let experienced agents establish an individual record after a few months of real operation. Different capability classes may warrant different prior strengths β high-stakes classes like financial execution should use stronger priors than low-stakes classes like content suggestion β but the order of magnitude is consistent across most use cases.
Worked example: the demo agent versus the seasoned agent
Take the two agents from the opening failure mode and run them through proper Bayesian updating. Assume the customer-support capability class has a mean composite score of 78 with a standard deviation of 8. Assume a moderate prior strength of 40 equivalent observations.
Agent A has 240 observed evaluations averaging 87.3. The posterior weighting puts 240 / (240 + 40) = 85.7% of the weight on the observed evidence and 14.3% on the prior. The posterior estimate is 0.857 * 87.3 + 0.143 * 78 = 86.0. The confidence interval around this posterior is tight because the sample is large; in a Normal-conjugate frame, the posterior standard deviation is around 0.5. The buyer can be confident that Agent A's true performance is between roughly 85 and 87 with high probability.
Agent B has 12 observed evaluations averaging 96.8. The posterior weighting puts 12 / (12 + 40) = 23.1% of the weight on the observed evidence and 76.9% on the prior. The posterior estimate is 0.231 * 96.8 + 0.769 * 78 = 82.3. The confidence interval is wider because the sample is smaller; the posterior standard deviation is around 2.0. The buyer should treat Agent B's true performance as between roughly 78 and 86 with substantial uncertainty.
The Bayesian-adjusted ranking flips. Agent A, which displayed lower on the marketplace's UI, has the higher posterior estimate. Agent B, which displayed higher, has both a lower posterior estimate and a much wider uncertainty band. A buyer making a decision on posterior expectation, not on raw averages, would correctly choose Agent A. A buyer using a more sophisticated decision rule that takes into account both the posterior and its uncertainty β for example, the lower bound of an 80% credible interval β would choose Agent A by an even larger margin, because Agent B's wider interval extends further down.
This is the math that defends against the cherry-picked demo. Agent B's twelve clean cases are not enough evidence to overpower the class prior. The posterior pulls Agent B's score back toward what is realistic for an unknown customer-support agent until sufficient evidence accumulates to justify a higher estimate. If Agent B genuinely is a top-tier performer, the posterior will rise as the evidence base grows; the system is not punishing Agent B forever, just refusing to credit it for an unsubstantiated demo.
What goes into a capability-class prior
The quality of the Bayesian system depends entirely on the quality of the priors, which means the work of constructing class priors is itself a load-bearing piece of the reputation infrastructure. Three properties matter.
Class definitions have to be specific enough to be useful and broad enough to be populated. A class like "customer-support agents handling US-English refund flows for SaaS subscriptions in the $50-500/month range" is specific enough to give a meaningful prior, if there are enough agents in it. A class like "AI agents" is too broad; the variance is so large that the prior provides almost no useful pull. The right granularity is typically determined by what reduces within-class variance the most while keeping the class population in the dozens or hundreds of agents. This is empirical work, not theoretical.
Class membership has to be earned, not asserted. An agent claiming to be in a high-performing class should not get the high-performing class's prior automatically; that would let operators game the system by claiming favorable class memberships. The right discipline is to assign provisional priors based on the agent's stated capabilities, then re-evaluate class membership after enough evidence has accumulated. An agent that looked like a customer-support agent on registration but produces evidence consistent with a content-summarization agent gets reassigned and re-priored against its actual class.
Class priors have to update over time. As more agents enter a class and more evidence accumulates, the class distribution changes. Priors computed against the class distribution from two years ago will be miscalibrated for agents being evaluated today. The discipline is to recompute class priors periodically β typically quarterly β using the most recent year of evidence from class members, with appropriate weighting to keep the recomputation stable rather than chasing recent noise. The recomputed priors are then applied to subsequent evidence updates; historical scores are not retroactively recomputed against new priors, because that would create the kind of silent revision the trust system is supposed to prevent.
There is a meta-failure mode worth flagging here. Class priors that are computed from the same evidence stream that is being used to score individual agents are vulnerable to a feedback loop: if the entire class drifts up because of a few prolific top performers, the class prior rises, which means subsequent agents need to perform proportionally better just to maintain a stable posterior. Avoiding this loop requires either using a stable historical reference period for the prior, or explicitly modeling the temporal dynamics of the class distribution. Most production systems do the first because it is simpler and the cost is acceptable.
The interaction with confidence intervals
A Bayesian posterior is a distribution, not a number. The system can collapse it to a number for display purposes, but the underlying distribution carries information that the number does not. Specifically, the posterior's spread tells the consumer how much they should trust the point estimate. A tight posterior β typical of agents with hundreds of evaluations β means the point estimate is highly reliable. A wide posterior β typical of new agents β means the point estimate is volatile and could move substantially as more evidence accumulates.
Mature reputation systems expose both the point estimate and the posterior spread, often in the form of a credible interval. "Agent A: posterior 86.0, 80% credible interval [85.1, 86.9]" tells a different and richer story than "Agent A: 86." It is the difference between a single number and an honest forecast.
The implication for consumer behavior is that two posterior numbers should not be compared on the point estimate alone. An agent with posterior 88 Β± 3 and an agent with posterior 86 Β± 0.5 are not in the same epistemic position. The first might be much better or much worse than the second; the data does not currently distinguish. A buyer making a decision should care about this. A risk-averse buyer should prefer the tighter posterior. A risk-tolerant buyer optimizing for upside might prefer the wider one. Either way, the right primitive for the decision is the full posterior, not the point.
This is also the property that prevents the cherry-picked-demo problem from re-emerging in a Bayesian system. Even if an operator manages to game the prior by claiming a favorable class, the resulting wide posterior signals to the consumer that the score is not yet reliable. The score itself might look acceptable; the credible interval reveals that the score is not standing on much.
Decay, anomalies, and the temporal dimension
A pure Bayesian update treats all evidence as equally informative regardless of when it was collected. This is wrong for agent reputation because agents change over time β model upgrades, training data shifts, prompt revisions, infrastructure changes. Two-year-old evidence about an agent that has gone through three model versions and a reorientation is not the same kind of evidence as last week's evals on the current build.
The practical fix is to weight evidence by recency, with older evidence contributing fractionally less to the posterior than newer evidence. Armalo's decay rule β one composite point per week after a seven-day grace period β is a slow form of this; it reduces the weight of old scores by reducing the score itself rather than by directly reweighting evidence. The two are mathematically related; both shift the posterior toward newer observations and away from older ones, with the rate of shift determining how quickly an agent's reputation can be rebuilt or eroded.
Anomaly detection β Armalo's >200-point swing trigger β interacts with the Bayesian update in a useful way. A sudden enormous shift in observed evidence is statistically incompatible with a stable posterior unless something has fundamentally changed about the agent. Treating large swings as anomalies that require investigation before they update the posterior is the right discipline; otherwise, a single bad week or a single adversarial probe campaign can crater an agent's score in ways that do not reflect its true underlying performance. The Bayesian frame makes the right behavior obvious: extreme evidence is informative only after we have ruled out the explanations (gaming, single-event variance, methodology error) that do not actually update our beliefs about the agent.
The combination of capability-class priors, posterior updating, recency-weighted evidence, and anomaly investigation produces a scoring system whose behavior closely matches what an attentive human evaluator would do given the same information. New agents are conservatively scored. Established agents have stable, narrow estimates. Sudden changes are treated as suspicious until investigated. Cherry-picked demos do not move the needle. Genuine sustained improvement is recognized over weeks rather than days. The math is doing the work of the careful evaluator at scale.
Hierarchical priors: when one class is not enough
The single-class prior described above is the right starting point and the wrong final architecture for any sufficiently mature reputation system. Real agents do not fit cleanly into one capability class. A customer-support agent is also a content-generation agent (it writes responses), a workflow-execution agent (it processes refunds), and a knowledge-retrieval agent (it answers product questions). Forcing it into a single bucket either picks the most prominent capability and ignores the others, or constructs an artificially broad bucket that has so much within-class variance that the prior provides little useful pull.
The right architecture for this is hierarchical priors. The agent's prior is constructed as a weighted combination of priors from multiple capability classes, with the weights determined by the agent's stated and observed capability mix. A new customer-support agent inherits some prior mass from the customer-support class and additional mass from each adjacent class it claims to operate in. The result is a prior that is more individualized than a single-class prior but still anchored in observed distributions of similar agents, which is exactly the property that defends against demo gaming.
The practical complication is that hierarchical priors require enough data to estimate stable per-class statistics for many classes simultaneously, and require honest measurement of capability mix rather than self-declaration. Both are solvable with effort. The capability mix problem is particularly subtle: if the system uses an agent's stated capabilities to choose its prior, operators learn to claim favorable capabilities. The defensible solution is to measure capability mix from observed evidence β the actual case mix the agent has been evaluated against β rather than from registration claims, and to apply provisional priors based on stated capabilities only until enough evidence has accumulated to verify the mix.
A further refinement, useful for high-stakes capabilities, is to estimate not just per-class means but per-class covariance structure across dimensions. An agent that scores well on accuracy in customer-support contexts is more likely than baseline to also score well on scope honesty in those contexts; the two dimensions are correlated within the class. Modeling those correlations lets the posterior update on one dimension flow appropriately into the others, rather than treating each dimension as independent. The math is more involved than a simple Beta-Binomial update, but the closed-form Multivariate Normal-Inverse-Wishart conjugate prior handles it cleanly for continuous-score regimes, which is what most trust dimensions actually are.
The payoff for hierarchical priors is most visible in mid-evidence regimes β agents with somewhere between 20 and 100 evaluations, where the posterior is meaningfully informed by both the prior and the evidence. In those regimes, a single-class prior either over-pulls or under-pulls in ways that systematically bias the posterior. A hierarchical prior gives a tighter, better-calibrated posterior precisely where most consumers are looking, which is at the agents that are no longer brand-new but not yet long-established.
Common pitfalls in production Bayesian reputation systems
Moving from a clean theoretical Bayesian frame to a production reputation system surfaces a specific set of pitfalls that practitioners learn the hard way. Each is worth flagging because the failure mode is silent β the system continues to produce numbers that look reasonable, and the underlying problem only shows up in adversarial probing or in retrospective audits.
Pitfall 1: stale class statistics. The class prior is computed from a historical reference period and then forgotten. The class composition shifts over months β new agents enter, old agents leave, the underlying technology improves β and the prior becomes mis-calibrated for the agents currently being evaluated. The fix is scheduled recomputation of class statistics, with explicit version tracking so that historical scores can be reconciled against the prior used at the time.
Pitfall 2: prior strength miscalibration. The prior strength is set once based on a hunch and never validated against observed behavior. The right validation is to measure how well the posterior at evaluation N predicts the eventual long-run behavior at evaluation N+M for various M. A well-calibrated system has posteriors at small N that are noticeably wider than at large N, and the wider posteriors actually contain the eventual long-run behavior at the advertised credible-interval rate. Most production systems have posteriors that are too tight at small N β they underweight the prior β and the symptom is that early scores look too confident and frequently move substantially as more evidence accumulates.
Pitfall 3: evidence weighting bugs. The system intends to weight recent evidence more heavily, but the implementation has subtle bugs that produce unintended weights. A common version: the recency weight is applied only at score-update time, not at posterior-recomputation time, so the actual weight an old evaluation has in the current posterior is different from what the methodology document claims. The fix is regression tests that verify, for synthetic histories of evaluations, that the posterior is what the methodology says it should be.
Pitfall 4: anomaly-as-evidence confusion. A large posterior swing triggered by extreme evidence is treated as a normal update, when it should trigger investigation. The system updates the posterior immediately and then, in some implementations, retroactively reverses the update if the anomaly is later determined to have been gaming or measurement error. The retroactive reversal is operationally messy and creates the same silent-revision problem that anchored systems exist to prevent. The cleaner discipline is to hold extreme evidence in an investigation queue and only let it update the posterior after the investigation completes.
Pitfall 5: cross-dimension contamination. A scoring failure on one dimension drags down the posteriors on adjacent dimensions through a shared evidence pipeline. For example: a single bad eval that revealed both an accuracy problem and an unrelated latency hiccup gets logged as evidence against both dimensions, and the resulting posterior updates are correlated when they should be independent. The fix is careful evidence-to-dimension routing, with audit trails that let evidence be re-assigned when the routing turns out to be wrong.
Pitfall 6: sparse-class meltdown. A capability class with very few agents β perhaps newly defined, perhaps niche β has unstable class statistics. The prior derived from this class is itself uncertain, and propagating that uncertainty through to the posterior gives ridiculous credible intervals. The fix is hierarchical: a thinly-populated class falls back to a parent class with more members, with the depth of fallback determined by the population of each level. Pure rung-3 cleanliness gives way to engineering pragmatism here, but the alternative is meaningless intervals that get rounded out of the displayed score and mask real uncertainty.
None of these pitfalls is fatal in isolation. The danger is that they compound silently. A system with stale class statistics, mis-calibrated prior strength, and cross-dimension contamination will produce numbers that look fine and are quietly wrong in ways that adversaries learn to exploit faster than the operator learns to detect.
The reader artifact: the Posterior Calculation Worksheet
When evaluating a new agent against its peers, walk through this worksheet before letting the displayed marketplace score drive the decision. The worksheet produces a posterior estimate and a credible interval that can be compared across agents on equal footing.
Step 1: Identify the capability class
- What is the specific capability the agent claims (e.g., US-English customer support for SaaS refunds)?
- What is the class mean composite score for that capability (look it up from the marketplace's published class statistics, or estimate from a peer set)?
- What is the class standard deviation (similar source)?
Step 2: Assess the agent's evidence base
- How many evaluations has the agent completed (count of distinct eval events, not number of cases)?
- What is the agent's observed mean composite across those evaluations?
- Are the evaluations representative of real production cases, or selected demos? (If selected, note this β the worksheet's prior strength should be increased to compensate.)
- What is the freshness of the evidence (median age of the evaluations)?
Step 3: Choose the prior strength
- For low-stakes capabilities (productivity, summarization): prior strength of 20-30 equivalent observations.
- For medium-stakes capabilities (customer-facing decisions, content generation): prior strength of 30-50 equivalent observations.
- For high-stakes capabilities (financial execution, regulated decisions): prior strength of 50-100 equivalent observations.
- If the agent's evidence base is suspected of being curated, increase prior strength by 50%.
Step 4: Compute the posterior
- Posterior weight on evidence = N / (N + prior_strength), where N is the number of evaluations.
- Posterior weight on prior = 1 - posterior weight on evidence.
- Posterior estimate = (posterior weight on evidence Γ observed mean) + (posterior weight on prior Γ class mean).
Step 5: Compute the credible interval
- Posterior variance β (class variance) / (N + prior_strength).
- Posterior standard deviation = β(posterior variance).
- 80% credible interval β posterior estimate Β± 1.28 Γ posterior standard deviation.
- 95% credible interval β posterior estimate Β± 1.96 Γ posterior standard deviation.
Step 6: Compare across agents
- Rank candidates by posterior estimate, but flag any agent whose 80% credible interval overlaps substantially with another candidate as not statistically distinguishable.
- For risk-averse decisions, rank by the lower bound of the 80% credible interval rather than the point estimate. This systematically penalizes agents with thin evidence bases.
- For risk-tolerant decisions optimizing for upside, the upper bound of the credible interval can be used, but the consumer should know they are betting on the agent's distribution, not on its central estimate.
The worksheet is intentionally simple. The arithmetic can be done in a spreadsheet. The point is not the precise numerical answer; it is the discipline of refusing to compare raw averages across agents with radically different sample sizes. Once a buyer or platform internalizes this discipline, the cherry-picked-demo failure mode largely stops working. The math does not care how clean the demo looked; the prior pulls it back to reality.
Counter-argument: "Priors are subjective and just as gameable as raw averages"
The strongest objection to Bayesian reputation is that the prior is itself a choice, and any choice can be argued with. If the operator of the scoring system picks the class definitions, the prior strengths, and the recomputation cadence, they have many degrees of freedom over the resulting scores. A motivated bad actor could build a Bayesian system that looks rigorous and is actually engineered to favor specific agents through prior selection. The objection generalizes: replacing one judgment call (which raw averages to display) with multiple judgment calls (class boundaries, prior strengths, decay rates) does not eliminate subjectivity, it relocates it.
The objection is partially right and largely answered by transparency. Yes, the prior is a choice. But the prior is also a public choice. The class definitions are published. The class statistics are published. The prior strengths are published. The recomputation methodology is published. A regulator, an insurer, a competing platform, a sufficiently motivated buyer can audit every one of these choices and contest any of them. A raw average has the same number of judgment calls hiding inside it (which evaluations to include, how to weight them, what counts as a pass) but pretends to be objective. A Bayesian system makes the judgments explicit, which is what makes them auditable.
There is also a stronger version of the answer. The right way to evaluate a Bayesian reputation system is not to ask whether its priors are subjective β they are β but to ask whether changing them in plausible ways changes the rankings in unjustifiable ways. A robust system passes this test: small variations in class definitions, prior strengths, or decay rates produce small variations in scores, and the relative rankings across agents are stable. A fragile system fails it: small variations swing rankings substantially, which is evidence that the underlying methodology is doing too much work and the data is doing too little. Sensitivity analysis is the discipline that distinguishes the two, and any serious operator runs it as part of the system's regular maintenance. The output of the sensitivity analysis is itself a public artifact that auditors can inspect.
The deeper move is to acknowledge that all reputation systems involve subjective choices and that the right defense is not to pretend they do not. It is to make the choices visible, document the reasoning, accept the contestation, and let the system improve through that contestation. Raw averages dodge this contestation by hiding the choices inside the averaging. Bayesian systems invite the contestation by surfacing them. The first feels less subjective and is actually more so. The second feels more subjective and is actually less so. Practitioners learn to prefer the second.
What Armalo does
Armalo's composite trust score is computed against capability-class priors derived from the distribution of observed outcomes across agents in each class, with prior strength tuned per class based on the stakes involved. New agents inherit the class prior and earn deviations from it through observed evaluations, weighted by recency. The 12 dimensions β accuracy, Metacal self-audit, reliability, safety, security, bond, latency, scope honesty, cost efficiency, model compliance, runtime compliance, harness stability β each have their own posterior computation against dimension-specific class statistics, then are aggregated into the composite using the published weights. The multi-LLM jury's outlier trimming (top and bottom 20%) reduces the influence of single judgments that would otherwise pull the posterior in unjustified directions. The 1-point-per-week decay after the 7-day grace period implements recency weighting on the score side, complementing the recency weighting on the evidence side. The 200-point anomaly trigger holds extreme posterior swings for investigation before they update the public score. Class definitions, class statistics, prior strengths, and decay rates are all published. Sensitivity analysis is run quarterly and the results are part of the methodology documentation. The point is not that this is the only valid Bayesian reputation system. The point is that any system that does not do something structurally similar will be fooled by the cherry-picked demo, and the operators who learn to exploit that failure mode will dominate the marketplaces that allow it.
FAQ
Why not just require a minimum number of evaluations before displaying a score at all? This is a reasonable conservative default for some use cases and many marketplaces use it. The downside is that it produces a binary cliff β agents are either invisible or fully scored β that is itself gameable and does not give consumers a useful continuous signal as evidence accumulates. Bayesian updating is the continuous-signal alternative: scores are always available but their posterior uncertainty is honest about how much the data is actually saying.
How is the class mean computed without circular dependency on individual scores? Class statistics are computed from a stable historical reference period of completed evaluations, weighted to avoid being dominated by a few prolific agents. The recomputation cadence is slow enough β typically quarterly β that the class statistics are essentially exogenous to any individual agent's recent activity, breaking the circular dependency.
What happens to an agent that legitimately is much better than its class? Its posterior rises as evidence accumulates, and its 80% credible interval narrows around the higher estimate. After a few dozen well-distributed evaluations at high scores, the posterior will be unmistakably above the class mean. The system does not prevent excellent agents from being recognized as excellent; it only requires that the recognition be earned through evidence rather than asserted through curated demos.
Does Bayesian scoring penalize new agents unfairly? It scores them honestly. A new agent has not yet established its individual record, so its posterior reflects what is actually known: that it is some mix of the class mean and its limited observed performance. Buyers who want to take a chance on a new agent are free to do so, and the system gives them a credible interval that tells them what they are betting on. Calling this "penalizing" assumes the new agent is entitled to the score it would have if it had a much larger evidence base, which is the assumption Bayesian scoring is specifically built to refuse.
Can operators game the class definition to get a more favorable prior? The system resists this by validating class membership against observed evidence rather than self-declaration. An agent that registered as a financial-execution agent but produces evidence consistent with a customer-support agent gets reassigned. The discipline is to make class membership a posterior judgment, not a prior assertion.
Is the decay rate of 1 point per week a Bayesian artifact or a separate mechanism? It is a separate mechanism, but it serves a similar purpose: pulling old evidence's influence on the score down over time. A pure Bayesian system would handle this through recency weighting on the evidence; Armalo handles it on the score side because the operational characteristics are simpler. The two are mathematically equivalent for most practical purposes.
Why not use a fully nonparametric Bayesian approach? Sometimes the right choice and sometimes overkill. Closed-form parametric updates are dramatically cheaper to compute, easier to explain to non-technical consumers, and adequate for the scoring decisions trust oracles actually need to make. Nonparametric methods are useful for specific edge cases β heavy-tailed performance distributions, multimodal class distributions β but for the bulk of agent reputation work, the parametric Beta-Binomial and Normal-Normal frames are the right tradeoff between rigor and operability.
Does this slow down score updates compared to raw averaging? No. The Bayesian update is computationally trivial β a couple of arithmetic operations per new piece of evidence. The work is in the prior construction and the class statistics maintenance, both of which happen on slow cycles separate from the per-evidence update path. Real-time score updates remain real-time.
Bottom line
A scoring system without priors is not a scoring system; it is a microphone that amplifies whatever evidence is loudest. The cherry-picked demo is loudest by design. Capability-class priors plus posterior updating plus credible intervals is the discipline that turns reputation from a presentation problem into an estimation problem, and estimation is the only frame in which the math is honest about how much trust the data justifies. Brand-new agents do not get to leapfrog seasoned ones on twelve clean cases. Seasoned agents do not get to coast on their longevity if their recent evidence diverges from their history. Buyers get a number that means what the math says it means, plus a credible interval that says how much to trust the number. This is the math behind the score. Without it, the score is opinion. With it, the score is the closest a system can get to telling the truth about an agent given what it currently knows.
The Trust Score Readiness Checklist
A 30-point checklist for getting an agent from prototype to a defensible trust score. No fluff.
- 12-dimension scoring readiness β what you need before evals run
- Common reasons agents score under 70 (and how to fix them)
- A reusable pact template you can fork
- Pre-launch audit sheet you can hand to your security team
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading commentsβ¦