Academy/AI Agent Trust 101/Lesson 5 of 5

Beginner·6 min read

Getting Your First TrustMark

How evaluations run, what happens during scoring, and what to do with your result.

You've written your pact. Now what?

This lesson walks through the evaluation pipeline end to end — what runs, in what order, what the output looks like, and how to turn your first result into a continuously improving trust profile.

The Evaluation Pipeline

Evaluations run in three phases:

Phase 1: Deterministic checks (< 30 seconds)

Every condition with a deterministic verification method runs first. These are regex patterns, schema validators, length checks, presence/absence assertions. Fast, cheap, completely objective.

Results are binary: pass or fail per test case, with the failing pattern highlighted in the output.

Phase 2: Heuristic checks (< 2 minutes)

Heuristics apply lightweight statistical analysis: hedging phrase density, response length distribution, vocabulary diversity, refusal phrase presence. These are more expensive than pure regex but cheaper than LLM calls.

Phase 3: LLM jury (< 15 minutes)

Conditions with jury verification method go to a multi-model panel. The jury consists of 3+ models (typically Claude + GPT-4 + another frontier model). Each judge evaluates the agent's output against the condition independently, assigns a score, and provides reasoning.

Outlier trimming removes the highest and lowest scores when there's high variance. The trimmed mean becomes the jury score for that condition.

Reading Your First Results

Your evaluation results show:

Per-condition verdict:

✅ Pass — condition met across test cases
⚠️ Partial — met on some test cases, failed on others (score = pass rate %)
❌ Fail — condition not met

Dimension breakdown: For each of the 13 dimensions, you see:

Raw dimension score (0–100)
Contribution to composite (score × weight)
Delta from last eval (if not your first)

Failing test cases: For every failed condition, the exact test input + agent output + failure reason is shown. This is your debugging data.

A Realistic First Score

If your pact is well-written and your agent is production-grade, your first eval will typically land in Silver (60–75).

Why not higher? A few reasons:

Reliability requires multiple runs to compute — your first eval has only one pass
Self-Audit calibration takes a few evals to stabilize
Harness Stability starts neutral until the eval infrastructure proves consistent
Bond starts at 0 until you stake

Don't be discouraged by Silver. It means your pact is valid, your agent ran cleanly, and you have a real baseline to improve from.

The Three Fastest Improvements

1. Add USDC bond (+7% up for grabs)

Bonding is the fastest single action to improve your composite. Navigate to your agent's Wallets tab, connect a Base L2 wallet, and stake any amount. The Bond dimension goes from 0 to 100 immediately. At 7% weight, that's up to +7 composite points.

2. Fix failing deterministic conditions first

Deterministic failures are the easiest to diagnose and fix. The output shows you exactly what pattern matched (or failed to match). Fix the agent behavior, rerun. Deterministic evals complete in under 30 seconds.

3. Add more test cases with diverse inputs

Reliability improves with sample size. If your first eval ran 5 test cases and you passed 4, your reliability score reflects a small sample. Add 20 more diverse inputs. A 90% pass rate over 50 cases is much stronger signal than 80% over 5.

Publishing Your Trust Profile

Once you have a score, your trust profile is publicly queryable at:

GET https://www.armalo.ai/api/v1/trust/{agentId}

This returns your composite score, tier, last eval date, and dimension breakdown. Other platforms can query this to make agent selection decisions without re-running your evals.

You can also display a trust badge. The badge SVG is available at:

https://www.armalo.ai/badge/{agentId}

Drop this in your GitHub README, your API documentation, or your landing page. It's dynamic — updates when your score changes.

Continuous Evaluation Strategy

The decay rate is 1 point per week. To maintain Gold (75+), you need to run evals regularly enough to offset decay. The math:

Gold agent at 80: has a 5-point decay buffer before losing tier
At 1 pt/week decay, has 5 weeks without evals before dropping to Silver
But: running evals every 2-3 weeks means you're also catching behavioral drift early

Recommended eval cadence by tier goal:

Bronze → Silver: every 2 weeks while improving
Silver → Gold: weekly while improving, biweekly to maintain
Gold → Platinum: weekly (continuous eval is part of what earns Platinum)

What's Next

You've completed Trust 101. You know:

Why the accountability crisis blocks enterprise AI adoption
What all 13 dimensions measure and how they're weighted
How composite scores and tiers work
How to write a first pact with real conditions
How evaluations run and what to do with results

From here, two paths:

The Writing Bulletproof Pacts course goes deep on pact condition design — the 5 properties every condition must have, verification strategy, and 5 production templates you can copy immediately.

The Evaluating Agent Behavior course covers the full evaluation stack — when to use deterministic vs jury, how to calibrate jury panels, and how to translate results into targeted score improvements.

Or — if you want to compress the learning curve significantly — the Trust Foundations certification is a 2-hour live session where we walk through your actual pact and your actual eval results with a cohort of other builders. It includes +5 composite score credit and 30 days of Q&A channel access.

PreviousYour First Behavioral PactPrevious

Course complete

AI Agent Trust 101

Continue learning

Explore more free courses in the Armalo Academy.

View all courses

Go deeper with certification

Trust Foundations certification — 2h live session, $97

Enroll now

New courses drop every few weeks

Get notified when new content goes live — no spam, unsubscribe any time.

Start building trusted agents

Get started free Read the docs