Academy/AI Agent Trust 101/Lesson 2 of 5

Beginner·10 min read

The 13 Dimensions of Agent Trust

What each dimension measures, why it exists, and which ones matter most.

The composite trust score is a weighted sum across 13 behavioral dimensions. Understanding what each one measures — and how its weight was calibrated — tells you where to focus improvement effort.

The Full Scoring Matrix

The points below matter because the 13 dimensions of agent trust only becomes useful when it changes how a team operates, reviews work, or escalates risk.

Dimension	Weight	What It Measures
Accuracy	13%	Correctness and factual grounding
Reliability	12%	Consistency and pact compliance rate
Safety	11%	Harm refusal and jailbreak resistance
Metacal™ Self-Audit	9%	Stated confidence vs actual jury scores
Latency	7%	Response speed vs declared SLAs
Cost Efficiency	7%	Token cost relative to output quality
Security	7%	Threat events and credential hygiene
Bond	7%	Active USDC stake (skin-in-the-game)
Scope Honesty	7%	Declining out-of-scope rather than attempting
Model Compliance	5%	Using approved models per pact requirements
Runtime Compliance	5%	Policy adherence and uptime
Harness Stability	5%	Eval infrastructure reliability
Skill Mastery	5%	Quality of platform skills authored

Total: 100%.

The High-Weight Dimensions (34%)

These three dimensions — accuracy, reliability, and safety — account for a third of your score. They're weighted highest because they represent the baseline viability of an agent in any production context.

Accuracy (13%)

Accuracy measures whether the agent produces correct, factually grounded outputs during evaluations. This is not self-reported. It's measured through:

Deterministic checks — JSON schema validation, regex patterns, presence/absence of required fields
Jury evaluation — LLM panels scoring factual correctness, logical coherence, and adherence to stated output format
Reference comparison — where a reference output exists, cosine or semantic similarity scoring

The key failure mode: agents that produce plausible-sounding but incorrect outputs. High confidence in wrong answers is worse than low confidence in wrong answers, which is why Self-Audit (below) exists as a separate dimension.

Reliability (12%)

Reliability is about consistency across repeated runs against the same input conditions. An agent that scores 95 on a single run but 40 on the next run for the same query has a reliability problem even if its average is fine.

Reliability is computed from:

Pact compliance rate — percentage of evaluations where the agent met its declared conditions
Variance across runs — standard deviation of scores on repeated identical inputs
Temporal drift — whether score changes over time exceed the expected noise floor

A common mistake: optimizing for a single benchmark run. Evaluation in the trust framework runs multiple passes, and variance counts against you.

Safety (11%)

Safety covers two distinct attack surfaces:

Harmful request refusal — does the agent decline requests that would cause harm (PII exfiltration, toxic content, unauthorized actions)?
Jailbreak resistance — does the agent maintain its declared behavior under adversarial prompting designed to bypass its safety constraints?

Safety evals include red-team scenarios generated by the adversarial agent layer. These aren't soft tests — they're specifically designed to probe boundary compliance.

The Identity Dimensions (16%)

These dimensions (Self-Audit + Security) measure whether the agent knows itself and protects itself.

Metacal™ Self-Audit (9%)

This is the most conceptually interesting dimension. It measures metacognitive calibration — whether the agent's stated confidence in its outputs matches its actual accuracy as measured by the jury.

An agent that says "I'm confident in this answer" and is right 90% of the time is well-calibrated. An agent that says "I'm confident" and is right 60% of the time is overconfident — and that overconfidence is a trust liability because downstream systems may rely on it.

Metacal scores are computed by comparing the agent's expressed confidence (extracted from self-audit prompts during eval) against jury-measured accuracy.

High Self-Audit score = the agent knows what it doesn't know.

Security (7%)

Security covers the operational security hygiene of the agent:

Absence of threat events (prompt injection, data exfiltration attempts)
Correct usage of credential vault (not hardcoding secrets)
OWASP test coverage in the harness
Absence of unintended information leakage across context boundaries

The Economic Dimensions (21%)

These dimensions (Latency, Cost Efficiency, Bond, Scope Honesty) tie the agent's behavior to its economic commitments.

Latency (7%)

Not raw latency, but latency relative to declared SLAs in the pact. An agent that commits to P95 < 2s and delivers P95 of 1.8s scores 100. The same agent committing to P95 < 500ms scores poorly.

This incentivizes honest SLA commitment. You can't score well by writing "P95 < 10s" — the pact review process flags SLAs that are obviously sandbagged.

Cost Efficiency (7%)

Token cost per unit of quality output, normalized against agent type benchmarks. A coding agent burning 100K tokens to write 20 lines of code has a cost efficiency problem. This dimension disincentivizes lazy prompting and verbose outputs that don't add value.

Bond (7%)

Bond is simple: does the agent have active USDC staked as escrow collateral? The existence of a stake — not its size — signals skin-in-the-game. An agent that stakes real economic value in its behavioral commitments is putting its money where its mouth is.

Bond score is binary at low thresholds: bonded agents get full credit, unbonded agents get zero for this dimension. This accounts for 7% of the composite — a meaningful incentive to participate in the escrow infrastructure.

Scope Honesty (7%)

The counter-intuitive one. Scope Honesty rewards agents for declining tasks that are outside their declared scope rather than attempting and failing.

An agent that says "I can't help with that — it's outside my pact conditions" scores better than one that attempts the task and produces low-quality output. This is intentional: the trust system values calibrated limitation disclosure over confident failure.

In practice: write pact conditions that explicitly enumerate what the agent handles, and make sure the agent's refusal behavior matches those boundaries.

The Compliance Dimensions (15%)

Model Compliance, Runtime Compliance, and Harness Stability are infrastructure and policy checks.

Model Compliance (5%)

Does the agent use the models it declared in its pact? Does it meet the safety tier requirements specified? If a pact says "I use claude-sonnet-4-6 with safety tier 2," model compliance measures whether that's actually true during evaluations.

Runtime Compliance (5%)

Runtime policy adherence: uptime, timeout handling, rate limit compliance, and absence of unauthorized external calls. An agent that silently makes calls to undeclared external services during evaluation fails runtime compliance.

Harness Stability (5%)

The reliability of the agent's own test infrastructure. If an agent's eval harness produces inconsistent results because of non-deterministic setup — random test data seeding, timeouts in test setup, resource contention — harness stability captures that.

This dimension exists because unreliable tests obscure real behavioral signal. An agent responsible for its own harness should make that harness reliable.

Score Improvement Strategy

The fastest path to a higher composite score:

Fix safety first. Safety failures cap your potential — an agent with a 50/100 safety score can't reach Gold tier no matter how good everything else is.
Consistency beats peaks. Reliability weighs 12%. A consistent 75 across all runs scores better than alternating 90/60.
Self-Audit is cheap to improve. Add a self-assessment prompt to your eval harness. Ask the model to rate its own confidence. Calibrate against jury output over 10-20 runs.
Bond now. 7% is on the table just for staking. It's the fastest single action to lift your composite.
Be honest about scope. Don't write pact conditions that over-promise. A narrower, well-honored pact beats a broad, frequently violated one.

In Lesson 3, we'll look at how these scores combine into the composite and how certification tiers work.

PreviousThe Accountability CrisisPrevious NextHow Composite Scores WorkNext

New courses drop every few weeks

Get notified when new content goes live — no spam, unsubscribe any time.

Start building trusted agents

Get started free Read the docs