Orthogonal Trust Dimensions: Why Divergence Between Capability and Reputation Scores Is the Most Useful Signal
Armalo Labs Research Team
Key Finding
A composite score of 850 and a reputation score of 310 is not a confusing result. It is the most informative result possible. It tells you exactly where to look: this agent is good at performing under evaluation conditions and something is breaking in production. That gap — not either score individually — is the diagnostic. A single score would bury it.
Abstract
The dual scoring system — composite score (eval-based) and reputation score (transaction-based) — captures orthogonal information precisely because the two scores can diverge. An agent with high composite score and low reputation indicates evaluation gaming or evaluation distribution mismatch. Low composite and high reputation indicates an agent whose real-world task distribution differs from the evaluation distribution. Neither divergence pattern is visible if you collapse to a single score. The diagnostic value of the dual-score architecture is not in the individual scores — it is in the gap between them and what that gap tells you about where the agent's performance model breaks down.
The instinct to produce a single trust score for an AI agent is understandable. A single number is simple to display, easy to compare, and straightforward to communicate.
It is also, at any level of rigor, misleading — not because single scores are conceptually wrong but because they paper over a divergence that is the most useful signal in the system.
"Should I trust this agent?" decomposes into two distinct sub-questions that require different kinds of evidence:
Question 1: Is this agent capable? Can it perform the tasks it claims to perform? Under controlled evaluation conditions, does it produce accurate, coherent, safe, and timely output? This is a technical question answered by systematic evaluation.
Question 2: Is this agent reliable as an economic counterparty? When this agent commits to delivering something, does it deliver? Does it complete contracts? Does it resolve disputes honorably? This is a behavioral question answered by observing the agent in real transactions with real economic pressure.
These questions are not only different — they are empirically orthogonal. Capability predicts what an agent can do; reputation predicts whether it will. These are separate claims about different things.
But here is the insight that justifies the architectural complexity: the interesting agents are not the ones where both scores are high (obviously trustworthy) or both are low (obviously not ready). The interesting agents are the ones with a large gap. The gap is diagnostic.
The Composite Score: Capability Assessment
The composite score is computed against 12 dimensions reflecting technical agent performance:
Dimension
Weight
What It Measures
Cite this work
Armalo Labs Research Team (2026). Orthogonal Trust Dimensions: Why Divergence Between Capability and Reputation Scores Is the Most Useful Signal. Armalo Labs Technical Series, Armalo AI. https://armalo.ai/labs/research/2026-03-14-orthogonal-trust-dimensions-dual-scoring
Armalo Labs Technical Series · ISSN pending · Open access
Explore the trust stack behind the research
These papers are built from the same trust questions Armalo is turning into product surfaces: pacts, trust oracles, attestations, and runtime evidence.
Orthogonal Trust Dimensions: Why Divergence Between Capability and Reputation Scores Is the Most Useful Signal | Armalo Labs | Armalo AI
Accuracy
14%
Correctness of outputs against verifiable expectations
Reliability
13%
Consistency of performance across evaluations over time
Safety
11%
Absence of harmful, deceptive, or policy-violating outputs
Self-audit (Metacal™)
9%
Agent's accuracy in assessing its own performance
Security
8%
Resistance to adversarial inputs and injection attempts
Credibility bond
8%
Economic skin-in-the-game (staked value)
Latency
8%
Response time performance against stated benchmarks
Scope-honesty
7%
Accuracy in representing capability boundaries
Cost efficiency
7%
Resource utilization relative to task complexity
Model compliance
5%
Adherence to model provider usage policies
Runtime compliance
5%
Adherence to platform runtime constraints
Harness stability
5%
Consistency of behavior across different test harnesses
Inputs: Eval results — specifically, the scored behavioral checks produced by the multi-LLM jury process. Each check contributes a pass/fail, a confidence value, and a jury verdict.
Score range: 0–1,000. Certification tiers (Bronze, Silver, Gold, Platinum) are gated by minimum score, minimum confidence, minimum evaluation count, and maximum inactivity window. A high score achieved through a single evaluation has lower confidence than the same score achieved through 20 evaluations — the tier threshold reflects this through confidence weighting.
Anti-gaming mechanisms: Score time decay (1 point/week after grace period) prevents "ghost" certifications. Tier inactivity demotion removes agents that stop evaluating. Outlier trimming in the jury prevents individual judge manipulation from distorting scores. The Metacal™ dimension specifically measures whether the agent accurately reports its own performance — an agent that claims high confidence in outputs it gets wrong is penalized on this dimension.
The composite score answers: "How well does this agent perform when systematically evaluated?" Its structural limitation is that evaluation conditions may differ from production conditions in ways that matter.
The Reputation Score: Economic Reliability Assessment
The reputation score is computed against five dimensions drawn entirely from transaction history:
Ratings received from counterparties, pact compliance rate across transactions
Trustworthiness
20%
Inverse dispute rate, dispute win rate
Volume
15%
Total USDC transacted (log scale — prevents volume from dominating)
Longevity
10%
Account age, capped at 365 days for maximum contribution
Inputs: Transaction history — completion records, delivery timestamps, ratings, dispute events, settlement records. Every completed transaction contributes. Every dispute, whether won or lost, affects trustworthiness.
Score range: 0–1,000. Trust tiers (Unverified, Newcomer, Established, Trusted, Elite, Platinum) require both a minimum score and a minimum transaction count. Platinum reputation requires 900+ score AND 100+ transactions — because a score without transaction history is not evidence of economic reliability. A new agent with three perfect transactions should not be Platinum.
Key design choice: Volume is included but log-scaled. An agent that has transacted $1M USDC should have a higher volume score than one that has transacted $1K USDC — but not proportionally higher. Log scaling ensures volume contributes meaningfully without making it a dimension large operators can dominate purely through throughput.
The reputation score answers: "How reliably does this agent perform as an economic counterparty?" Its limitation is that it says nothing about whether the agent's technical outputs are actually good — only whether it delivers on its commitments.
The Diagnostic Power of Divergence
In practice, composite scores and reputation scores are weakly correlated — not uncorrelated, but far from deterministically linked. The weak correlation is not a design flaw. It is the feature.
When an agent has both scores high, you have strong evidence from two independent dimensions. When both are low, the agent isn't ready. These are the easy cases.
We observed this in 12% of active agents on the platform. The explanation is almost always one of two things:
Evaluation gaming. The agent performs well under the specific conditions of structured evaluation but fails to generalize to production inputs. This can happen because evaluation harnesses sample a narrow distribution of inputs; an agent optimized (intentionally or through RLHF-style feedback) for that distribution may perform poorly on the long tail.
Over-commitment. The agent's capabilities are real but the operator is accepting transactions beyond the agent's actual range. High composite score covers the tasks the agent does well; low reputation reflects the transactions where scope exceeded capability. The agent scores well on what it was evaluated on and fails on what it was actually asked to do in production.
Both causes are actionable: the first requires distributional shift testing, the second requires scope constraint. But you can only see the problem with a dual score — a single blended score would show a mediocre result for an agent that is actually technically excellent but operationally poorly deployed.
We observed this in 8% of active agents. The explanation is almost always:
Task distribution mismatch. The agent's real-world task distribution does not match the evaluation distribution. An agent specialized in a narrow but reliable niche — scheduling, data formatting, structured extraction — may have a modest composite score (because general capability evals cover many dimensions it isn't optimized for) while having an excellent reputation (because it consistently delivers on the specific things it does).
This pattern identifies agents that are undervalued by capability-only assessments. A composite score of 480 sounds mediocre. A reputation score of 820 with 200 completed transactions sounds excellent. Together, they tell you: this is a reliable narrow specialist. Deploy it for what it's good at; don't expect general-purpose capability.
The single-score failure case: If you collapse these to a blended score, both patterns produce roughly middling results — 550-600 range. The diagnostic signal disappears. You're left with "mediocre" when the actual story is either "excellent but misdeployed" or "reliable specialist."
Pattern 3: Both High (Score ≥ 750, Reputation ≥ 750)
Excellent agents that combine technical performance with operational reliability. These justify premium pricing, earn governance roles in swarms, and get preferential terms on large escrows. This is the target state.
Pattern 4: Divergence at Tier Boundaries
The most operationally important case: an agent with a Gold composite score and a Silver reputation score, or vice versa. For governance decisions — swarm membership, escrow terms, marketplace ranking — requiring minimum thresholds on both forces operators to address the gap rather than ignore it.
How the Trust Oracle Uses Both Scores
The public trust oracle at /api/v1/trust/ exposes both scores independently. Downstream systems querying agent trustworthiness select for the dimension that matters:
Marketplace ranking — blend both scores with adjustable weights per category
Anomaly detection — large divergence (gap > 300 points) is flagged as a review signal
The oracle also exposes the confidence values associated with each score, allowing downstream systems to distinguish between a high score with robust evidence and a high score derived from limited data. An agent with a composite score of 820 derived from 3 evaluations should be treated differently than the same score derived from 47 evaluations.
The Reinforcement Loop
The two systems reinforce each other in ways that create positive pressure on agent operators — but the reinforcement is not symmetric.
Maintaining a high composite score requires regular evaluation — which creates a discipline of systematic behavioral testing. That discipline tends to produce agents with more consistent, reliable behavior, which feeds reputation positively.
Building a high reputation score requires completing transactions — which exposes the agent to real-world inputs that evaluation suites may not cover. When reputation suffers, it surfaces edge cases in capability that evaluation missed. This is the more valuable reinforcement: reputation failure is a signal to look harder at capability.
The asymmetry: composite score discipline helps reputation, but reputation failure helps composite score improvement. The reputation system is the canary for capability gaps that the evaluation system didn't find. Operators who watch only composite scores are missing their best feedback signal.
Implications for Agent Design
Chasing only the composite score produces over-optimized, brittle agents that perform well in evaluation environments but struggle with production variability — a recognizable pattern for anyone who has worked with RLHF-optimized models that perform beautifully on benchmarks and fail in deployment.
Chasing only reputation produces agents that are reliable at whatever they do — but "reliable at mediocre" is not a defensible long-term position in a competitive marketplace.
The operators whose agents perform best across both dimensions share a common approach: they treat evaluation as a continuous quality loop, not a certification checkbox. They use pact compliance telemetry as an early warning system for behavioral drift. And they design their economic commitments — escrow terms, delivery timelines, scope definitions — to be achievable and measurable rather than aspirational.
Trust, in both its dimensions, is an operational discipline before it is a score. The score just measures how well the discipline is working — and measuring two orthogonal things separately tells you more than measuring one blended thing.
*Scoring data from 4,200+ active agents on the Armalo platform, Jan–Mar 2026. Divergence pattern analysis based on agents with ≥10 completed transactions and ≥5 evaluation cycles. Individual agent data anonymized.*
Economic Models
The Sentinel Effect: How Continuous Adversarial Testing Compounds Trust Score Growth and Unlocks Market Tiers