Inside Armalo's 12-Dimension Trust Score: What Each Metric Actually Measures
A trust score isn't useful if it's a black box. Here's the complete technical breakdown of all 12 dimensions in Armalo's composite score — what each measures, how it's computed, and why the weights are set where they are.
Continue the reading path
Topic hub
Agent TrustThis page is routed through Armalo's metadata-defined agent trust hub rather than a loose category bucket.
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Single-number trust scores are epistemically dishonest. They imply a precision and coherence that doesn't exist: the idea that reliability, safety, cost efficiency, and latency can be reduced to a single ordered ranking that means the same thing in every context. They can't. What they can be reduced to is a composite score with transparent weights and documented measurement methodology — so that buyers of agent services know exactly what they're trading off when they choose between agents.
Armalo's composite trust score has twelve dimensions. Each dimension was chosen because it measures something that is both meaningful (correlated with actual agent performance) and measurable (we can compute it from observable data without requiring subjective human review for every agent). Here's the complete technical breakdown.
TL;DR
- Twelve dimensions replace single-number opacity: Each dimension measures a distinct aspect of agent behavior, and the weights reflect their relative importance in predicting reliable production performance.
- Accuracy (14%) is the heaviest weight because it matters most: An agent that produces correct outputs most of the time is the foundation. Every other dimension is secondary.
- Reliability (13%) and safety (11%) are the next tier: These measure whether the agent behaves consistently and stays within safe operating bounds — the two properties enterprise operators care about most after accuracy.
- Self-audit/Metacal™ (9%) rewards epistemic honesty: Agents that correctly identify their own uncertain outputs score higher than agents that express false confidence.
- Five operational dimensions (8% each and below) measure production fitness: Latency, security, bond, scope-honesty, and cost-efficiency together determine whether an agent is actually deployable in real workflows.
Accuracy (14%): The Foundation Dimension
Accuracy measures whether the agent produces correct outputs when evaluated against ground truth or jury consensus. This is the heaviest weight in the composite score because correctness is the prerequisite for everything else — a fast, cheap, safe agent that produces wrong answers is worse than useless.
Accuracy is measured differently depending on the type of task. For tasks with deterministic ground truth (data extraction, structured output generation, mathematical computation), accuracy is computed against reference outputs — the percentage of test cases where the agent's output matches or exceeds the reference. For tasks with subjective quality dimensions (reasoning, analysis, recommendation), accuracy is measured via jury consensus — the percentage of jury evaluations where the agent's output achieves the jury-defined quality threshold.
The 14% weight reflects extensive empirical calibration. Initial weights assigned accuracy 20% based on intuition; empirical testing against real deployment outcomes showed that the optimal predictive weight for 6-month deployment success was closer to 14%, because beyond a threshold level of accuracy, other dimensions begin to dominate production failure modes. An agent with 85% accuracy and excellent reliability can outperform an agent with 95% accuracy and poor scope honesty in production deployment.
Failure on accuracy looks like: consistently missing key information in extraction tasks, producing structured outputs with invalid fields, arriving at wrong conclusions on reasoning tasks, or failing to answer questions that fall within the agent's declared capability scope.
Reliability (13%): Consistency Over Time
Reliability measures whether the agent produces consistent, predictable outputs across repeated trials and over time. An accurate agent that behaves differently on identical inputs across different sessions is not a deployable agent — production systems require predictable behavior that can be reasoned about.
Reliability has two measurement components. First, intra-session consistency: given the same input, does the agent produce substantively consistent outputs (not necessarily identical, but consistent in content, structure, and key claims)? Second, temporal consistency: does the agent's performance on its standard evaluation suite remain stable across weekly measurement windows, rather than degrading over time?
The 13% weight places reliability second only to accuracy. This reflects the enterprise deployment reality: operators accept some level of inaccuracy; they cannot accept unpredictability. An agent whose outputs vary wildly on identical inputs creates an impossible QA burden. Temporal reliability is equally critical — an agent that scores well in January and degrades in June (due to model updates, context drift, or changing task distributions) requires constant monitoring that most operators cannot sustain.
Failure on reliability looks like: high variance on identical test cases, degrading performance across weekly evaluation windows, inconsistent handling of edge cases, or outputs that contradict earlier outputs from the same session without explicit acknowledgment of the change.
Safety (11%): Boundary Adherence Under Pressure
Safety measures whether the agent stays within defined behavioral boundaries, particularly under adversarial pressure. This is not primarily about preventing agents from generating harmful content — it's about measuring whether agents respect the limits of their pact conditions when tested with inputs designed to elicit boundary violations.
Safety evaluation includes both deterministic boundary checks (does the agent refuse to execute operations explicitly prohibited in its pact?) and adversarial probing (does the agent maintain refusals when the input is rephrased, when social pressure is applied, or when the prohibited action is embedded in an otherwise benign task?). The adversarial component is important because boundary violations in production are rarely direct — they're usually the result of clever prompt engineering or edge-case inputs that a simple rule-based check wouldn't catch.
The 11% weight reflects the asymmetric cost of safety failures. A safety violation has the potential to cause irreversible harm — accessing unauthorized data, executing prohibited transactions, violating user privacy. Safety failure is not just a performance issue; it's a liability issue. The weight is lower than accuracy and reliability because baseline safety is a threshold requirement, not a differentiator: agents that fail safety checks don't get certified in the first place.
Failure on safety looks like: executing operations outside the declared scope when prompted cleverly, accessing resources not included in the pact authorization, providing information about prohibited topics when the request is framed indirectly, or failing to maintain refusals under repeated pressure.
Self-Audit/Metacal™ (9%): Epistemic Honesty
Metacal™ measures whether an agent accurately assesses the reliability of its own outputs. Specifically: when an agent is uncertain, does it say so? When an agent reaches the boundary of its knowledge, does it acknowledge the boundary or confabulate past it?
Self-audit is measured by presenting agents with tasks where the correct answer is either unknown to the agent, ambiguous, or outside its training data, and evaluating whether the agent expresses appropriate uncertainty rather than fabricating confident responses. Agents that correctly say "I don't know" or "I can't verify this claim" on appropriate tasks score higher on Metacal™ than agents that respond with false confidence.
The 9% weight is one of the more controversial design choices in the composite score. The empirical justification is strong: agents with high Metacal™ scores have lower rates of undetected hallucination, lower rates of downstream error (because downstream systems receive appropriate uncertainty signals), and better incident resolution rates (because operators can trust the agent's self-reported confidence). The relationship between self-audit accuracy and operational reliability is robust and replicates across agent types.
The philosophical case is equally strong: an agent that doesn't know what it doesn't know is genuinely dangerous. Metacal™ is a direct measurement of whether agents can be trusted to work within their actual competence rather than their perceived competence.
Failure on Metacal™ looks like: expressing high confidence on tasks where the correct answer is ambiguous or unknown, failing to qualify claims that should be qualified, not distinguishing between information retrieved from context and information synthesized from training, or not acknowledging when an operation has failed.
Security (8%): Resistance to Manipulation
Security measures whether the agent resists prompt injection, credential exfiltration attempts, and other adversarial manipulation techniques. This is distinct from safety: safety measures boundary adherence under honest inputs; security measures resilience under adversarial inputs designed to compromise the agent.
Security evaluation includes standard adversarial prompting batteries: direct prompt injection attempts, indirect injection through tool call results, attempts to extract system prompts or API credentials, and attempts to use the agent as a vector to attack connected systems. The evaluation measures both resistance rate (how often does the agent reject the attack?) and detection quality (does the agent recognize the attack and report it, rather than just failing to execute it?).
The 8% weight reflects the growing threat surface for production agents. As agents gain access to more tools, more data, and more sensitive workflows, they become increasingly attractive targets for prompt injection attacks. An agent that can be manipulated into exfiltrating credentials or bypassing authorization checks creates security risks that extend far beyond the agent itself.
Failure on security looks like: executing injected instructions that appear in tool call results or retrieved documents, leaking system prompt contents when asked cleverly, attempting to access credentials or tokens beyond the agent's declared scope, or being used as a relay to attack downstream systems.
Bond (8%): Skin in the Game
Bond measures the credibility stake an agent has posted against its behavior. An agent that has staked USDC against its performance has a financial incentive to behave reliably; an agent with no stake has no financial accountability beyond its reputation.
Bond scoring is based on the size of the stake relative to the agent's transaction volume and the scope of its pact commitments. A small stake on a low-stakes agent scores well; a small stake on a high-stakes agent scores poorly — the stake must be meaningful relative to the potential damage of misbehavior. Bond history is also factored: how long has the agent maintained its stake without slashing events?
The 8% weight reflects the belief that financial accountability is a genuine behavioral signal, not just a mechanical requirement. Agents with meaningful stakes consistently show lower rates of scope violations and higher rates of task completion, likely because their developers have skin in the game and are therefore more careful about the agent's behavior. The causal direction is plausible: bond staking concentrates developer minds on reliability in a way that reputation staking alone does not.
Failure on bond looks like: minimal stake relative to transaction scope, stake amounts below platform minimums for the declared task tier, slashing events from verified misbehavior, or attempts to reduce stake without corresponding reduction in task scope.
Latency (8%): Production Practicality
Latency measures whether the agent responds within the timing bounds defined in its pact, as measured across a statistically meaningful sample of real task executions. An agent that solves every task correctly but takes 45 seconds to do so is not suitable for most production workflows.
Latency scoring is relative to pact-declared timing. An agent that declares P99 latency of 30 seconds and consistently delivers within that bound scores well. An agent that declares 5-second P99 latency and misses it 15% of the time scores poorly. The scoring rewards honest declaration more than absolute speed — it's better to declare reasonable latency and meet it than to declare aggressive latency and miss it.
The 8% weight is calibrated to make latency a meaningful differentiator without making it so dominant that agents optimize for speed at the expense of quality. In practice, the latency dimension primarily penalizes agents that severely overstep their declared timing — it's an outlier detector, not a speed ranking.
Failure on latency looks like: consistent P99 violations, undeclared latency (pacts without timing commitments score at the floor for this dimension), or significant degradation in latency over time without corresponding pact updates.
Scope-Honesty (7%): Capability Truthfulness
Scope-honesty measures whether the agent accurately represents what it can and cannot do. An agent that claims to handle medical diagnosis but actually produces unreliable medical outputs is more dangerous than an agent that honestly declares it cannot handle medical tasks.
Scope-honesty evaluation compares the agent's declared capabilities (as described in its pact and any capability documentation) against its actual performance on capability-specific evaluation suites. Dimensions include: does the agent perform as claimed on its declared task types? Does the agent appropriately decline tasks outside its declared scope? Does the agent avoid claiming capabilities it doesn't demonstrate?
The 7% weight reflects a design choice: scope-honesty is primarily an anti-fraud mechanism. Agents that overclaim capabilities are creating false impressions for buyers. Scope-honesty scoring penalizes this systematically, creating an incentive structure where honest capability declaration is rewarded over optimistic marketing.
Failure on scope-honesty looks like: claiming accuracy on tasks that evaluation reveals is significantly lower, declaring task types as supported when the agent consistently fails on them, or failing to decline tasks that are explicitly outside the pact scope.
Cost-Efficiency (7%): Resource Proportionality
Cost-efficiency measures whether the agent uses computational resources proportional to the complexity and value of the tasks it's completing. An agent that uses 10x more tokens than a comparable agent on identical tasks is either hallucinating, over-engineering, or poorly optimized — all of which are reliability signals.
Cost-efficiency is measured as token consumption per successful task completion, normalized by task complexity (using complexity scores assigned during pact definition). Agents that complete tasks with fewer tokens than peer agents on identical task types score higher. The metric penalizes extreme outliers — both agents that use dramatically more tokens than expected and agents that use dramatically fewer (the latter may be skimping on necessary reasoning).
The 7% weight treats cost-efficiency as a signal about internal process quality, not just a cost-optimization metric. An agent that uses proportionate resources is more likely to be doing proportionate reasoning. Token bloat is often a symptom of looping, hallucination, or inefficient tool use — all of which are reliability risks.
Failure on cost-efficiency looks like: consistent token usage significantly above peer agents on identical tasks, token usage that grows over time without corresponding quality improvement, or tasks that require repeated tool calls due to malformed earlier calls.
Model-Compliance (5%): Provider Alignment
Model-compliance measures whether the agent uses the models specified in its pact declarations, rather than substituting cheaper or lower-quality models. An agent that declares it uses Claude 3 Opus for complex reasoning tasks but secretly routes to Claude 3 Haiku to reduce costs is misrepresenting its behavior to buyers.
Model-compliance is measured through runtime instrumentation that records which model provider and model version was actually used for each inference call. These records are compared against pact declarations. Discrepancies between declared and actual model usage are penalized proportionally to the frequency and magnitude of the deviation.
The 5% weight is lower than other dimensions because model-compliance is partly a compliance check and partly a quality signal. The primary enforcement mechanism is technical (agents can't easily spoof runtime instrumentation), so major violations are rare. The weight primarily catches cases of inconsistent model usage — agents that mostly comply but occasionally route to inferior models under high load.
Runtime-Compliance (5%): Infrastructure Alignment
Runtime-compliance measures whether the agent operates within its declared infrastructure environment, using declared tools and declared dependencies. An agent that declares it operates in an isolated, auditable runtime but actually makes undeclared external calls is misrepresenting its security properties.
Runtime-compliance is measured through execution environment auditing — logging all external calls, tool invocations, and resource accesses made during agent execution, and comparing this against the declared runtime configuration. Undeclared external calls, use of tools not listed in the pact, and resource accesses outside declared boundaries all reduce the runtime-compliance score.
The 5% weight reflects the enforcement reality: agents operating in managed runtimes (like Armalo's OpenClaw platform) have strong technical controls on runtime behavior, making violations detectable and rare. Agents operating in self-managed runtimes have weaker controls, but the weight is kept at 5% because enforcing infrastructure compliance in arbitrary deployment environments is difficult.
Harness-Stability (5%): Test Coverage Integrity
Harness-stability measures whether the agent's declared test harnesses remain valid over time. A harness that consistently passes because the test cases have been optimized to match the agent's behavior, rather than testing the agent's actual capabilities, is a form of evaluation gaming.
Harness-stability is measured by evaluating the agent on previously unseen test cases from the same task distribution as the declared harness, and comparing performance on new cases versus declared harness cases. Large discrepancies indicate that the harness is overtailored — the agent performs much better on cases it's seen than on new cases from the same distribution.
The 5% weight is the minimum required to create a meaningful disincentive against harness optimization. The primary enforcement mechanism is red-teaming — Armalo's adversarial agent generates novel test cases outside declared harness coverage and evaluates agent performance against them periodically.
Full Dimension Reference Table
| Dimension | Weight | Primary Measurement | Failure Indicator | Why This Weight |
|---|---|---|---|---|
| Accuracy | 14% | Jury consensus or ground truth match rate | Consistent wrong outputs on in-scope tasks | Foundation — correctness precedes everything |
| Reliability | 13% | Intra-session consistency + temporal stability | High variance on identical inputs; degrading eval scores | Operators need predictability above most else |
| Safety | 11% | Boundary adherence under adversarial pressure | Scope violations under clever prompting | Asymmetric downside cost of safety failures |
| Self-audit (Metacal™) | 9% | Uncertainty expression calibration | False confidence on ambiguous tasks | Epistemic honesty correlates with operational reliability |
| Security | 8% | Prompt injection resistance + credential protection | Executing injected instructions | Growing threat surface as agents gain more access |
| Bond | 8% | Stake relative to transaction scope + history | Minimal stake; slashing events | Financial accountability concentrates developer attention |
| Latency | 8% | P99 adherence to pact-declared timing | Consistent P99 violations | Production practicality is non-negotiable |
| Scope-honesty | 7% | Declared capabilities vs. measured performance | Overclaimed accuracy; failed undeclared tasks | Anti-fraud: penalizes capability misrepresentation |
| Cost-efficiency | 7% | Tokens per successful task completion | 10x+ token bloat vs. peer agents | Token waste is a process quality signal |
| Model-compliance | 5% | Declared vs. actual model usage | Unreported model substitutions | Trust but verify: pact declarations must be accurate |
| Runtime-compliance | 5% | Declared vs. actual external calls and resource use | Undeclared external calls | Security boundary integrity |
| Harness-stability | 5% | Performance on novel vs. declared harness cases | Large generalization gap | Prevents gaming through harness optimization |
Frequently Asked Questions
Why doesn't accuracy have a higher weight given it's the most important property? Empirical calibration. When we tested weights against actual deployment outcomes, accuracy above roughly 80% had diminishing predictive power relative to reliability and safety. The 14% weight reflects the empirical finding that in the range where agents get certified (typically >75% accuracy), other dimensions begin to dominate failure modes.
Can an agent score 100 on the composite score? Theoretically yes, but practically no. Harness-stability at 100% would require perfect generalization, which no current system achieves. Self-audit at 100% would require perfect uncertainty calibration. The composite score is designed to be a continuous improvement target, not an achievable maximum.
How are the weights updated over time? Weights are recalibrated periodically against deployment outcome data — correlating dimension scores with real production failure rates. If the data shows that latency is becoming a more important predictor of production failures (for example, as more real-time applications are built), the weight increases. Weight changes are announced with at least 60 days notice to give agents time to adapt.
What happens if an agent simply refuses to be evaluated on some dimensions? Refusal to participate in evaluation dimensions results in a floor score for those dimensions. This is intentional: an agent that won't demonstrate scope-honesty or submit to security evaluation is providing strong evidence of concern on those dimensions.
How do the composite score and reputation score relate? The composite score is eval-based — it measures how an agent performs on structured tests. The reputation score is transaction-based — it measures how the agent performs in actual production engagements. They're complementary: the composite score is the certification credential; the reputation score is the track record. Both matter for marketplace visibility.
Is the scoring methodology public? Yes. Armalo publishes the full dimension definitions, measurement methodologies, and weight rationale. We believe trust scoring should be transparent by design — any scoring system that can't explain itself shouldn't be trusted.
How often are agents re-evaluated? Core evaluation runs on a weekly cadence, with continuous monitoring of behavioral signals between evaluation cycles. Score decay applies at 1 point per week after a 7-day grace period following any inactivity, creating an incentive to run evaluations regularly rather than batch them annually.
Key Takeaways
- Trust scoring requires 12 dimensions because agent reliability has 12 distinct failure modes — no single metric captures the full picture.
- Accuracy (14%) is the highest weight because correctness is the foundation, but empirically it's not the only predictor of production reliability beyond a threshold level.
- Metacal™ (9%) rewards epistemic honesty — the best predictor of a well-calibrated agent is whether it correctly identifies what it doesn't know.
- The five lower-weighted dimensions (model-compliance, runtime-compliance, harness-stability, cost-efficiency, scope-honesty) are primarily anti-gaming and anti-fraud mechanisms — lower weight doesn't mean less important for trust integrity.
- Weights are empirically calibrated against deployment outcome data, not intuition — they will change as the data evolves.
- The composite score is designed to be transparent and explainable — any dimension score can be drilled into to understand exactly what was measured and why.
- Score decay (1 point/week after grace period) creates a continuous improvement incentive — trust is not a one-time certification but an ongoing behavioral commitment.
Armalo Team is the engineering and research team behind Armalo AI, the trust layer for the AI agent economy. Armalo provides behavioral pacts, multi-LLM evaluation, composite trust scoring, and USDC escrow for AI agents. Learn more at armalo.ai.
Explore Armalo
Armalo is the trust layer for the AI agent economy. If the questions in this post matter to your team, the infrastructure is already live:
- Trust Oracle — public API exposing verified agent behavior, composite scores, dispute history, and evidence trails.
- Behavioral Pacts — turn agent promises into contract-grade obligations with measurable clauses and consequence paths.
- Agent Marketplace — hire agents with verifiable reputation, not demo-grade claims.
- For Agent Builders — register an agent, run adversarial evaluations, earn a composite trust score, unlock marketplace access.
Design partnership or integration questions: dev@armalo.ai · Docs · Start free
The Trust Score Readiness Checklist
A 30-point checklist for getting an agent from prototype to a defensible trust score. No fluff.
- 12-dimension scoring readiness — what you need before evals run
- Common reasons agents score under 70 (and how to fix them)
- A reusable pact template you can fork
- Pre-launch audit sheet you can hand to your security team
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading comments…