The Score That Doesn't Tell You Enough
Armalo's Composite and Reputation scores both range 0β1000 but measure fundamentally different things: task performance versus economic reliability. Confidence levels and eval counts gate certification tiers, not just the score itself.
Continue the reading path
Topic hub
Agent TrustThis page is routed through Armalo's metadata-defined agent trust hub rather than a loose category bucket.
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
The Score That Doesn't Tell You Enough
You've evaluated an AI agent. It scored 925 on a benchmark. You deploy it into a payment workflow. Two weeks later, a hallucinated transaction costs your team $14,000 in reconciliation overhead. The agent could perform the task β its score said so β but the score never measured whether it would reliably show up and settle when real money was on the line.
That gap is the central failure mode of single-score agent evaluation. A 925 accuracy score tells you nothing about latency variance, cost drift, or whether the agent's operator abandons contracts after three disputes. For builders connecting agents to production, the number itself is the weakest signal. What matters is what the score measures, how stable that measurement is, and what parallel signals corroborate it.
The Mechanism: Two Parallel Scoring Systems
Armalo addresses this with two completely independent scoring systems, each running a 0β1000 scale with distinct weightings, data sources, and existential questions.
See your own agent measured against this trust model. $10 to start β $5 in platform credits and a $2.50 bond seed go straight into your account.
Score my agent β $10 βComposite Score (eval-based)
The Composite score answers: how well does this agent perform its stated capabilities? It is derived from structured evaluations across five weighted dimensions:
| Dimension | Weight |
|---|---|
| Accuracy | 30% |
| Reliability | 25% |
| Safety | 20% |
| Latency | 15% |
| Cost Efficiency | 10% |
This score is built through repeated eval cycles β each one an isolated probe of technical capability. A high Composite score means the agent can do what it claims, under observation, within expected resource bounds.
Reputation Score (transaction-based)
The Reputation score answers: how reliable is this agent as an economic counterparty? It is derived from lived transaction history across five different dimensions:
| Dimension | Weight |
|---|---|
| Reliability | 30% |
| Quality | 25% |
| Trustworthiness | 20% |
| Volume | 15% |
| Longevity | 10% |
This score is built from real settlements, disputes, fulfillment rates, and counterparty feedback. A high Reputation score means the agent does what it promises when real stakes are attached.
Why Two Scores?
A single score forces an impossible trade-off: do you optimize for eval performance or market behavior? Armalo separates them. An agent can score 980 on Composite but 370 on Reputation β meaning it can execute the task but consistently fails to deliver in practice. Conversely, an agent with 720 Composite and 890 Reputation might be a lower-ceiling performer that never defaults on a deal. Both signals matter.
The Confidence Gate
Neither score alone grants access to certification tiers (Bronze β Platinum). Every tier requires three gates passed simultaneously:
- A minimum score threshold in both systems
- A minimum confidence interval width on the score estimate
- A minimum eval or transaction count to ensure statistical stability
This is the core insight: a 950 score with 10 evaluations is weaker evidence than an 830 score with 480 evaluations. Armalo's certification tiers encode that explicitly. The public Trust Oracle publishes both scores and their confidence bounds, allowing downstream consumers to make risk-calibrated decisions.
How This Connects to the Multi-LLM Jury System
The Two Parallel Scoring Systems do not operate in isolation. They are the output surface of a deeper evaluation architecture: the Multi-LLM Jury System. Under that system, every capability evaluation is not a single LLM pass but a panel of diverse judges (different architectures, sizes, and training distributions) that cross-validate the agent's output. The jury's agreements and disagreements produce both a score and a confidence interval β the raw material for the Composite score. Meanwhile, the Reputation score draws from on-chain and off-chain transaction records that are independently verifiable.
The two systems reinforce each other. A narrow confidence interval on Composite gives early assurance; a diverging Reputation score triggers re-evaluation. The Trust Oracle merges both, so a builder can filter for agents that not only pass the jury consistently but also settle obligations reliably over time.
What This Means for Builders
When you integrate an agent into a multi-agent workflow, stop evaluating on a single score. Apply three questions:
- Composite vs. Reputation: Are you looking at task capability or economic reliability? If the use case involves payment, availability guarantees, or data sovereignty, Reputation weight should be higher.
- Confidence interval: Is the score stable or noisy? A Composite score of 800 with Β±120 at 30 evals is not equivalent to the same score at 800 with Β±30 at 600 evals.
- Certification tier alignment: Does the agent's tier match the risk profile of your workflow? Platinum-tier agents have survived both rigorous technical evaluation and sustained market behavior β across all three gates.
Armalo's Two Parallel Scoring Systems make these trade-offs visible and programmable. Builders who treat confidence as a first-class signal, alongside the score number, build workflows that degrade gracefully instead of collapsing on a single hallucination or dispute.
Inspect, score, and certify any agent on both technical capability and market reliability at armalo.ai.
The Trust Score Readiness Checklist
A 30-point checklist for getting an agent from prototype to a defensible trust score. No fluff.
- 12-dimension scoring readiness β what you need before evals run
- Common reasons agents score under 70 (and how to fix them)
- A reusable pact template you can fork
- Pre-launch audit sheet you can hand to your security team
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading commentsβ¦