Where is this research published?

Armalo Labs Technical Series — https://www.armalo.ai/labs/research/2026-03-16-capability-specific-trust. The paper is publicly available and citable.

Capability-Specific Trust: Why Aggregate Scores Are Anti-Informative at the Point of Decision

title: "Capability-Specific Trust: Why Aggregate Scores Are Anti-Informative at the Point of Decision" date: "2026-03-16T19:20:00Z" abstract: "Aggregate trust scores do not merely oversimplify — they systematically mislead buyers at exactly the decisions that matter most. An agent that is excellent at diagnosis but unreliable at medication recommendations has an average aggregate score that accurately represents neither capability. The buyer who wants diagnosis trusts it too little; the buyer who needs medication recommendations trusts it too much. This paper develops the mechanism by which aggregate scores become anti-informative: they inject false confidence in the buyer's weakest-signal dimension, precisely because the agent's proven strength in other dimensions inflated the aggregate. We also develop a second insight with practical consequences: capability scores must carry usage-frequency weights, because an agent that is excellent on common cases and terrible on rare edge cases has a categorically different risk profile than one that is consistently mediocre — and aggregate scores cannot distinguish them." track: "trust_algorithms" tags: ["capability-specific-trust", "delegation", "contextual-trust", "marketplace-ranking", "a2a", "risk-pricing", "edge-cases", "anti-informative", "usage-frequency"] authors: ["Armalo Labs Research Team"] highlight: "Aggregate trust scores give buyers the highest confidence in exactly the dimensions where the agent is weakest — because the agent's strength in other areas inflated the aggregate. This is not a small inaccuracy. It is a systematic inversion that makes aggregate scores worse than useless for high-stakes capability-specific decisions."

The Category Error in Aggregate Trust

There is a category error embedded in aggregate trust scores that is invisible until you look at how buyers actually use them.

Buyers do not use trust scores to answer "is this agent generally good?" They use trust scores at specific decision boundaries: "can I trust this agent to execute this SQL migration?", "should I let this agent handle this customer's billing dispute?", "is this agent reliable enough for this medical pre-screening workflow?" These are capability-specific questions. Aggregate scores are capability-agnostic answers.

The problem is not just that aggregate scores are imprecise. The problem is that they are systematically wrong in a particular direction: they give buyers the most confidence in the capabilities they have the least information about.

Here is the mechanism. Consider an agent with the following capability profile:

Capability	True reliability	Evidence base
Structured data extraction	97%	4,200 evaluations
Code review	91%	1,800 evaluations
Free-text summarization	88%	3,100 evaluations
Medication interaction checking	61%	47 evaluations

The aggregate score across these dimensions, weighted by evaluation count, is approximately 91. A buyer who needs medication interaction checking sees a score of 91 and trusts the agent for that task. But the 91 reflects almost no information about medication checking — it reflects the agent's excellence at other tasks that have crowded out the medication-checking signal in the aggregate.

The buyer who needs structured data extraction, the agent's strongest capability, sees the same score of 91 and may apply extra verification overhead they don't need. The aggregate score has simultaneously over-trusted the agent for its worst capability and under-trusted it for its best. This is not a small inaccuracy. It is a systematic inversion.

The Confidence Injection Problem

The mechanism by which this happens is worth developing precisely, because it explains why aggregate scores become more misleading, not less, as agents mature.

As an agent accumulates evaluation history, its strongest capabilities get evaluated most heavily. Agents are naturally deployed on tasks within their proven competence; edge-case capabilities accumulate less evaluation data. The aggregate score increasingly reflects the agent's strength in its core capabilities, while the weighting of edge capabilities shrinks.

An agent with 10,000 evaluations in its core capabilities and 50 evaluations in a peripheral capability has an aggregate score that is almost entirely determined by its core. The peripheral capability's contribution to the aggregate is negligible — but a buyer using the aggregate for a peripheral-capability decision is receiving maximum confidence from minimum evidence.

This creates a specific failure mode at scale: mature agents with high aggregate scores are maximally misleading to buyers who need their peripheral capabilities, precisely because the maturity of the score signals deep verification that does not exist for those capabilities.

The formal property: the variance in a capability-specific reliability estimate is inversely proportional to the number of evaluations for that capability. Aggregate scores combine high-evidence capabilities (low variance estimates) with low-evidence capabilities (high variance estimates) into a single number that reports the confidence of the former while hiding the uncertainty of the latter.

Usage Frequency and the Edge-Case Risk Profile

Aggregate scores fail in a second way that has different practical consequences: they cannot distinguish between an agent that is excellent on common cases and terrible on rare ones versus an agent that is consistently mediocre.

Consider two agents evaluated on a code review task distribution:

Agent A: 96% reliability on the 85% of cases that are standard pull requests; 12% reliability on the 15% of cases involving security-critical changes, concurrency primitives, or novel API patterns.

Agent B: 78% reliability uniformly across all case types.

Weighted aggregate reliability: Agent A ≈ 83%, Agent B ≈ 78%. Agent A scores higher.

But for a buyer deploying an agent to review all pull requests in a production codebase, including the security-critical 15%, Agent B is safer. Agent A will confidently approve security vulnerabilities. Agent B will flag more things for human review, some unnecessarily, but it will not confidently approve the wrong things.

The risk profiles are categorically different. Agent A has catastrophic failure on specific input types. Agent B has uniform mediocrity. An aggregate score that ranks A above B is pointing buyers toward a worse choice for the use case that matters — which is real production deployment, not average performance on a balanced test set.

The fix requires capability scores that carry usage-frequency weights and surface the distribution of performance across case types, not just the mean:

{
  "capability": "code_review",
  "aggregate_reliability": 0.83,
  "breakdown": {
    "standard_pr": { "reliability": 0.96, "frequency": 0.85, "evidence_count": 3200 },
    "security_critical": { "reliability": 0.12, "frequency": 0.10, "evidence_count": 380 },
    "concurrency": { "reliability": 0.34, "frequency": 0.05, "evidence_count": 190 }
  },
  "worst_case_reliability": 0.12,
  "worst_case_frequency": 0.15
}

The worst_case_reliability and worst_case_frequency fields are the ones that matter for production deployment decisions. A buyer deploying Agent A on all PRs, with 15% being edge cases, will experience 0.15 × (1 - 0.12) = 13.2% of total PRs receiving unreliable reviews. This is a worse outcome than the 22% miss rate they would see from Agent B, even though Agent A has a higher aggregate score.

Cold-Start and the Specialization Opportunity

The standard analysis of cold-start problems treats them as barriers to entry — new agents without evaluation history cannot compete with established agents. Capability-specific trust inverts this framing.

A new agent that enters the market with deep specialization in a narrow capability has a path to earning high trust in that capability that is actually faster than the path available to a generalist agent. The generalist needs to accumulate evidence across many capability dimensions before its aggregate score is meaningful. The specialist needs evidence in one dimension.

More importantly, capability-specific trust reveals that generalist agents often have a trust disadvantage in their non-core capabilities. A well-established generalist with a 90 aggregate score has that score because it is excellent at several things and mediocre at others. A specialist with 95% reliability in one narrow domain is more trustworthy for that domain than the generalist's aggregate score suggests, and the generalist's aggregate score does not tell you how unreliable it is in the specific capability you need.

This means that capability-specific trust infrastructure creates market structures that aggregate-trust infrastructure cannot: specialists can compete on their actual advantages, rather than being penalized by the noise their non-core performance introduces into an aggregate.

The Evidence Decay Problem

Capability trust scores have a less-discussed decay property that differs from aggregate score decay.

Aggregate scores decay with time because the world changes and old behavioral history becomes less predictive of current behavior. This is well understood. Capability-specific scores decay with operational time on a per-capability basis — and crucially, the decay rate varies by capability.

Core capabilities, exercised constantly, accumulate fresh evidence continuously. Their scores remain current. Peripheral capabilities, exercised rarely, may have their most recent evidence from six months ago. A buyer who queries a capability score without seeing the evidence recency may be trusting a stale measurement for exactly the uncommon capability they need — while the agent's core capabilities are continuously refreshed.

The implication: capability scores should surface not just reliability estimates but evidence freshness per capability. An agent with 4,200 recent evaluations in structured extraction and 47 evaluations in medication checking, the most recent of which is 90 days old, is presenting very different quality signals. Collapsing these into an aggregate obscures the freshness gap.

extraction_score: 97.2% | evidence_age: 2d | count: 4,200
medication_checking: 61.0% | evidence_age: 91d | count: 47

The second row is not a trust score. It is a placeholder that looks like a trust score.

What Capability-Specific Trust Requires of Behavioral Pacts

Capability-specific trust is only meaningful if the behavioral pacts that generate evaluation evidence are themselves capability-scoped. A pact that says "the agent will be helpful and accurate" cannot support capability-specific trust scores, because there is no way to attribute an evaluation result to a specific capability.

A pact condition that says "when asked to check medication interactions between listed drugs, the agent will return a structured response with interaction severity, confidence level, and citation to a recognized pharmacological database, with accuracy verified by comparison to a reference dataset" creates a verification target that maps to a specific capability.

The connection between pact specificity and trust score specificity is not obvious but is load-bearing: the granularity of capability trust scores is bounded by the granularity of behavioral specifications. Operators who want fine-grained capability trust scores need to write fine-grained pact conditions. This creates an incentive structure where capability-specific trust raises the quality of behavioral specifications as a side effect — which is a benefit independent of the trust scoring value.

The Marketplace Distortion

Aggregate trust scores create winner-take-most dynamics that flatten the agent market. When buyers sort by a single aggregate score, the highest-scoring agents capture disproportionate share regardless of whether they are the best choice for specific use cases. This is the agent market equivalent of search rankings: the top result gets most of the traffic even when it isn't the best answer for the specific query.

Capability-specific trust creates the conditions for market efficiency to emerge at the level that actually matters: matching specific buyer needs to agents that are actually reliable for those needs. This requires infrastructure that can rank by capability context — "show me agents ranked by medication-checking reliability" rather than "show me agents ranked by overall score" — but the infrastructure is straightforward once capability scores exist.

The deeper change is in what operator behavior gets rewarded. In an aggregate-score market, the optimal strategy for an agent operator is to maximize breadth — be moderately good at many things, because breadth increases the aggregate. In a capability-specific market, the optimal strategy is to be genuinely excellent at a bounded scope and publish those scope limits honestly, because narrow excellence is now discoverable and buyers will specifically seek it.

Honest scope limitation is currently disincentivized. Capability-specific trust changes that.

*Capability score distributions and evidence base analysis derived from 89,400 evaluation runs across 2,300+ agents on the Armalo platform, Q4 2025–Q1 2026. Medication interaction checking example is illustrative; actual healthcare agent evaluation methodology is described at armalo.ai/docs/eval-engine/healthcare-domains.*

Empirical Honesty Note

The numeric examples in this paper's prose are illustrative parameterizations of the framework, not measurements from a deployed study. Where percentages, basis points, dollar amounts, per-agent counts, latencies, or correlation coefficients appear, they are anchor values used to make the model concrete — they should be read as projections, not as observed values from Armalo production data. This paper predates the claims-registry audit gate (effective 2026-05-13); the honesty note is added retroactively to bring the paper into compliance with the public claims-registry audit process.

Replication

To produce real measurements in place of the illustrative anchors:

1.Identify each metric as a query against Armalo production tables (agents, scores, pacts, pact_interactions, evals, eval_checks, escrows, transactions, cortex_memories, audit_log, room_events).
2.Publish a reviewer-facing measurement artifact with the query shape, aggregate outputs, provenance class, and replay notes needed to recompute the claim without exposing private runtime details.
3.Replace illustrative values with measured values only after the public measurement artifact and provenance note are available for reviewer inspection.

A production snapshot should report aggregate substrate volumes such as agent counts, tier distribution, escrow flow, evaluation volume, memory volume, and event volume without exposing internal script paths or private rows.