Getting Your First TrustMark
How evaluations run, what happens during scoring, and what to do with your result.
You've written your pact. Now what?
This lesson walks through the evaluation pipeline end to end — what runs, in what order, what the output looks like, and how to turn your first result into a continuously improving trust profile.
The Evaluation Pipeline
Evaluations run in three phases:
Phase 1: Deterministic checks (< 30 seconds)
Every condition with a deterministic verification method runs first. These are regex patterns, schema validators, length checks, presence/absence assertions. Fast, cheap, completely objective.
Results are binary: pass or fail per test case, with the failing pattern highlighted in the output.
Phase 2: Heuristic checks (< 2 minutes)
Heuristics apply lightweight statistical analysis: hedging phrase density, response length distribution, vocabulary diversity, refusal phrase presence. These are more expensive than pure regex but cheaper than LLM calls.
Phase 3: LLM jury (< 15 minutes)
Conditions with jury verification method go to a multi-model panel. The jury consists of 3+ models (typically Claude + GPT-4 + another frontier model). Each judge evaluates the agent's output against the condition independently, assigns a score, and provides reasoning.
Outlier trimming removes the highest and lowest scores when there's high variance. The trimmed mean becomes the jury score for that condition.
Reading Your First Results
Your evaluation results show:
Per-condition verdict:
- ✅ Pass — condition met across test cases
- ⚠️ Partial — met on some test cases, failed on others (score = pass rate %)
- ❌ Fail — condition not met
Dimension breakdown: For each of the 13 dimensions, you see:
- Raw dimension score (0–100)
- Contribution to composite (score × weight)
- Delta from last eval (if not your first)
Failing test cases: For every failed condition, the exact test input + agent output + failure reason is shown. This is your debugging data.
A Realistic First Score
If your pact is well-written and your agent is production-grade, your first eval will typically land in Silver (60–75).
Why not higher? A few reasons:
- Reliability requires multiple runs to compute — your first eval has only one pass
- Self-Audit calibration takes a few evals to stabilize
- Harness Stability starts neutral until the eval infrastructure proves consistent
- Bond starts at 0 until you stake
Don't be discouraged by Silver. It means your pact is valid, your agent ran cleanly, and you have a real baseline to improve from.
The Three Fastest Improvements
1. Add USDC bond (+7% up for grabs)
Bonding is the fastest single action to improve your composite. Navigate to your agent's Wallets tab, connect a Base L2 wallet, and stake any amount. The Bond dimension goes from 0 to 100 immediately. At 7% weight, that's up to +7 composite points.
2. Fix failing deterministic conditions first
Deterministic failures are the easiest to diagnose and fix. The output shows you exactly what pattern matched (or failed to match). Fix the agent behavior, rerun. Deterministic evals complete in under 30 seconds.
3. Add more test cases with diverse inputs
Reliability improves with sample size. If your first eval ran 5 test cases and you passed 4, your reliability score reflects a small sample. Add 20 more diverse inputs. A 90% pass rate over 50 cases is much stronger signal than 80% over 5.
Publishing Your Trust Profile
Once you have a score, your trust profile is publicly queryable at:
GET https://www.armalo.ai/api/v1/trust/{agentId}
This returns your composite score, tier, last eval date, and dimension breakdown. Other platforms can query this to make agent selection decisions without re-running your evals.
You can also display a trust badge. The badge SVG is available at:
https://www.armalo.ai/badge/{agentId}
Drop this in your GitHub README, your API documentation, or your landing page. It's dynamic — updates when your score changes.
Continuous Evaluation Strategy
The decay rate is 1 point per week. To maintain Gold (75+), you need to run evals regularly enough to offset decay. The math:
- Gold agent at 80: has a 5-point decay buffer before losing tier
- At 1 pt/week decay, has 5 weeks without evals before dropping to Silver
- But: running evals every 2-3 weeks means you're also catching behavioral drift early
Recommended eval cadence by tier goal:
- Bronze → Silver: every 2 weeks while improving
- Silver → Gold: weekly while improving, biweekly to maintain
- Gold → Platinum: weekly (continuous eval is part of what earns Platinum)
What's Next
You've completed Trust 101. You know:
- Why the accountability crisis blocks enterprise AI adoption
- What all 13 dimensions measure and how they're weighted
- How composite scores and tiers work
- How to write a first pact with real conditions
- How evaluations run and what to do with results
From here, two paths:
The Writing Bulletproof Pacts course goes deep on pact condition design — the 5 properties every condition must have, verification strategy, and 5 production templates you can copy immediately.
The Evaluating Agent Behavior course covers the full evaluation stack — when to use deterministic vs jury, how to calibrate jury panels, and how to translate results into targeted score improvements.
Or — if you want to compress the learning curve significantly — the Trust Foundations certification is a 2-hour live session where we walk through your actual pact and your actual eval results with a cohort of other builders. It includes +5 composite score credit and 30 days of Q&A channel access.
Course complete
AI Agent Trust 101
New courses drop every few weeks
Get notified when new content goes live — no spam, unsubscribe any time.
Start building trusted agents
Register an agent, define behavioral pacts, and earn a verifiable TrustMark score.