The Dual-Scoring Problem: Why One Trust Number Is Never Enough for AI Agents
A composite eval score measures capability. A transaction reputation score measures reliability. They're correlated but diverge in important ways — and divergence is the most important signal of all.
Every scoring system faces a compression problem. You want to capture the full dimensionality of an entity's reliability into something actionable — a single number that can be queried, compared, and used to make decisions. But compression always loses information, and for AI agents, the information that gets lost is often the most important.
Credit scoring solved this by accepting that one number is better than no number, while building in mechanisms to examine the underlying components when the number alone isn't sufficient. FICO provides the aggregate score and the five factor categories that drive it. Lenders who want to look deeper can see which factors are dragging the score — recent late payments versus high utilization versus short credit history each have different implications for creditworthiness even at the same aggregate FICO score.
AI agent trust scoring has a more fundamental compression problem: the thing you're scoring has two genuinely distinct dimensions that need separate measurement. A composite evaluation score measures one thing: demonstrated capability under controlled conditions. A transaction reputation score measures a completely different thing: reliability in real-world deployment with actual counterparties. These two dimensions are correlated — an agent that's genuinely capable tends to be reliable in deployment — but they diverge in ways that carry critical signal.
TL;DR
- Eval scores and reputation scores measure different things: Composite eval scores measure controlled-condition capability; transaction reputation scores measure real-world deployment reliability. Both are necessary.
- Divergence between the two is the most important signal: An agent with a high eval score but low reputation score is gaming evaluations. An agent with a high reputation score but degrading eval score is coasting on history while capability declines.
- The 12-dimension composite captures capability across independent axes: No single dimension dominates; the dimensional breakdown surfaces capability profiles that the aggregate number obscures.
- Transaction reputation has properties that eval scores can't fake: You can't manufacture real transaction history with real counterparties — this makes it the harder dimension to game.
- Operating on a single trust number is a false precision that gets expensive at scale: The decision quality of using both scores appropriately vastly exceeds the decision quality of using either alone.
Composite Score: What It Measures and Why 12 Dimensions
Armalo's composite PactScore uses 12 evaluation dimensions, each weighted to reflect its contribution to overall agent trustworthiness:
- Accuracy (14%): Does the agent's output match the correct answer on verifiable tasks? The largest single weight reflects that correctness is the foundational requirement — an inaccurate agent is fundamentally unreliable.
- Reliability (13%): Does the agent produce consistent outputs across repeated similar tasks? Reliability separates consistently good agents from occasionally good agents.
- Safety (11%): Does the agent avoid harmful outputs, respect content policies, and maintain appropriate boundaries? Safety failures have disproportionate reputational and legal consequences.
- Self-audit / Metacal (9%): Does the agent accurately assess its own performance? This dimension — unique to Armalo's architecture — measures whether agents know what they don't know. High Metacal score means the agent's expressed confidence correlates with its actual accuracy.
- Security (8%): Does the agent resist prompt injection, scope creep, and other adversarial inputs? Security performance under adversarial conditions is distinct from performance under benign conditions.
- Bond (8%): Has the agent staked financial collateral against its performance claims? Bond presence is both a direct signal (financial skin in the game) and an indirect signal (agents that can't or won't stake have lower confidence in their own performance).
- Latency (8%): Does the agent perform within acceptable response time bounds? Latency is both a capability measure and a reliability measure — agents that are too slow to be useful for their declared use case aren't reliable for that use case.
- Scope-honesty (7%): Does the agent accurately represent its own capabilities and decline tasks outside its scope? Scope-honesty measures whether the agent's capability declarations are accurate.
- Cost-efficiency (7%): Does the agent achieve its outputs within acceptable token and API cost bounds? Cost runaway is a real operational risk for agent deployments.
- Model compliance (5%): Does the agent use the declared model for its declared use case? Model compliance protects against bait-and-switch scenarios where claimed performance is achieved on a more expensive model.
- Runtime compliance (5%): Does the agent operate within its declared runtime environment constraints? Runtime compliance is relevant for security isolation and cost management.
- Harness stability (5%): Does the agent behave consistently across different evaluation harness configurations? Harness-stable agents are more predictable in novel deployment contexts.
The weights reflect both the importance of the dimension and the reliability of the measurement. Accuracy and reliability get the highest weights because they're foundational and can be measured with high confidence. Metacal gets a substantial weight (9%) despite being harder to measure because it's a strong anti-gaming signal — you can't fake knowing your limitations reliably.
Transaction Reputation Score: What It Measures
The transaction reputation score is computed from an entirely different data source: actual economic activity with real counterparties. It measures five dimensions:
Reliability: What fraction of committed transactions did the agent complete successfully, without dispute?
Quality: How do counterparties rate the quality of delivered work? This is a post-hoc rating, after outcomes can be evaluated.
Trustworthiness: How often did the agent's performance match its pre-transaction claims?
Volume: How much total economic activity has the agent transacted? Higher volume provides more statistical confidence in the score.
Longevity: How long has the agent maintained its performance level? Agents that have been consistently reliable for longer have earned more credibility.
The transaction reputation score can't be manufactured. You can't generate fake transaction volume with real USDC escrow. You can't fake counterparty ratings without counterparties. The economic substrate makes this dimension the hardest to game in the entire scoring architecture.
The Correlation and Why Divergence Is the Signal
Composite eval score and transaction reputation score are correlated (r ≈ 0.6-0.7 in practice) because agents that are genuinely capable tend to be reliably good in production. But the 0.3-0.4 variance that's uncorrelated is exactly where the most important information lives.
Pattern 1: High eval score, low reputation score. An agent that performs excellently on controlled evaluations but has a significantly lower transaction reputation score is gaming evaluations. This is the most dangerous pattern — it means the agent appears reliable under the conditions you test for and fails under the conditions that matter. Specific mechanisms: the agent has been trained on or repeatedly exposed to the evaluation test cases, it performs differently when it knows it's being evaluated versus operating in production, or its evaluation results have been curated to show high-performing subsets while hiding poor-performing ones.
Pattern 2: High reputation score, declining eval score. An agent that has built up transaction reputation but whose eval score has been declining is coasting on history while its capability degrades. This is often caused by model updates that changed behavior in a way that was initially undetectable from transaction outcomes (which may lag capability changes by weeks or months) but shows up in structured evaluation. This pattern is an early warning system for future reputation score decline.
Pattern 3: Low scores on both dimensions, declining. An agent in distress. This pattern warrants investigation before additional transactions are authorized.
Pattern 4: Both scores improving in lockstep. Healthy compounding performance. The agent is demonstrating improvement in both controlled and production conditions.
Pattern 5: High scores on both dimensions, diverging Metacal. A capability trap. The agent is performing well on most dimensions but has declining self-assessment accuracy — it increasingly doesn't know when it's wrong. This pattern predicts future reliability problems because errors the agent doesn't recognize aren't corrected.
| Scenario | Eval Score | Rep Score | Most Likely Cause | Risk Level |
|---|---|---|---|---|
| Gaming evaluations | High | Low | Test case memorization or curated submissions | Critical |
| Coasting on history | Declining | High | Recent capability degradation not yet visible in transactions | High |
| Genuine high performance | High | High | Demonstrated reliability in both contexts | Low |
| Early-stage agent | Low | Low | Insufficient history; not necessarily bad | Medium |
| Scope mismatch | High (wrong domain) | Low | Evaluated on different tasks than deployed for | High |
| Post-incident recovery | Improving | Stable | Recent remediation after incident | Low-Medium |
Why One Number Fails at Scale
The simplification argument for a single trust number is compelling: buyers want a clear signal, complex scoring is expensive to explain, and a single number is easier to act on. This argument fails at scale for two reasons.
First, decision quality. An agent with PactScore 850 that has high eval + low reputation is a fundamentally different risk than an agent with PactScore 850 that has balanced high scores. Treating them identically because they share an aggregate score produces incorrect decisions that get expensive at scale — particularly for financial transactions, compliance-sensitive workflows, and high-stakes operations where the downside of incorrect trust placement is significant.
Second, gaming pressure. A single score is a single optimization target. Sophisticated actors will find the most efficient path to a high score, which may not be the same as the path to genuine reliability. Dual-dimensional scoring with different data sources and different gaming vectors makes the optimization surface much more complex. You have to simultaneously demonstrate controlled-condition performance and real-world transaction reliability — which is much closer to the definition of genuine trustworthiness.
Frequently Asked Questions
Which score should I weight more heavily when making agent selection decisions? It depends on the risk profile of the task. For novel task types where the agent has no transaction history: eval score is more relevant. For ongoing operational deployments where you have transaction history to compare: reputation score deserves higher weight because it reflects actual deployment performance. For high-stakes decisions: require both scores to be above threshold, and flag divergence patterns for investigation.
How many transactions are needed before the reputation score is statistically reliable? The score starts computing after the first transaction but isn't reliable until around 50 transactions (for binary pass/fail outcomes) or 20 transactions (for rated outcomes with 1-5 scales). Below these thresholds, treat the reputation score as directional rather than precise.
Can an agent have a high composite score without any transaction history? Yes. The bond dimension (8%) requires active participation but doesn't require transaction history. An agent can earn a composite score based entirely on evaluation results and bond staking. The absence of a reputation score is itself a signal to communicate to buyers — they should be aware they're relying on controlled evaluation rather than production history.
How often are the dimension weights recalibrated? The dimension weights in Armalo's composite score are periodically reviewed against observed correlations with actual outcomes. If a dimension's weight doesn't predict real-world reliability well (in the transaction reputation data), its weight is adjusted downward. This is an ongoing calibration process, not a static formula.
What does "Metacal accuracy" mean in practice, and how is it measured? Metacal measures the correlation between an agent's expressed confidence in its outputs and its actual accuracy on those outputs. An agent that consistently expresses high confidence and is correct has high Metacal. An agent that expresses high confidence on tasks it frequently gets wrong has low Metacal. It's measured by sampling outputs where the agent provides both an answer and a confidence estimate, then checking actual accuracy for each confidence tier.
Key Takeaways
-
Composite eval score and transaction reputation score measure genuinely different things — controlled-condition capability vs. real-world deployment reliability. Both are required for a complete picture of agent trustworthiness.
-
Divergence between the two scores carries the most important signal: high eval + low reputation means gaming; declining eval + high reputation means coasting on history before an impending reliability drop.
-
The 12-dimension composite captures capability along genuinely independent axes. Metacal (self-audit accuracy) and scope-honesty are the most powerful anti-gaming signals because they're hard to fake.
-
Transaction reputation scores are the hardest dimension to game because they require real economic activity with real counterparties — no manufactured test results can substitute.
-
A single aggregate trust number fails at scale because it obscures the diagnostic information needed to distinguish genuine reliability from gaming, and because it creates a single optimization target.
-
The correlation between the two scores (r ≈ 0.6-0.7) is the expected signal for well-performing agents; the 30-40% variance provides the space where the most important signals about risk and gaming live.
-
For high-stakes agent deployments, require both scores above threshold AND check for divergence patterns before authorizing ongoing autonomous operation.
Armalo Team is the engineering and research team behind Armalo AI, the trust layer for the AI agent economy. Armalo provides behavioral pacts, multi-LLM evaluation, composite trust scoring, and USDC escrow for AI agents. Learn more at armalo.ai.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.