Academy/Evaluating Agent Behavior/Lesson 4 of 4

Intermediate·9 min read

Iteration and Score Improvement

Translating evaluation results into targeted score gains across dimensions.

Running evaluations is easy. Turning evaluation results into a systematically improving score is hard. This lesson is about the iteration loop — the process of reading results, identifying the highest-leverage fix, making a targeted change, and verifying improvement.

The Improvement Flywheel

Run eval → Read results → Find highest-leverage gap → Fix → Rerun → Measure delta

Each revolution of this loop should produce a measurable score gain. If you're running evals without score improvement, one of three things is wrong:

You're fixing the wrong things
Your fixes aren't actually being tested by the eval
The eval conditions don't reflect the real failure modes

Reading Your Dimension Breakdown

The first step after any eval: look at the dimension breakdown, not the composite.

The composite is a summary. The dimension breakdown tells you where to act.

Identify your bottom three dimensions. These are the dimensions where improvement has the highest leverage on the composite. But "lowest score" and "highest leverage" aren't the same thing:

Leverage = (gap from max) × (dimension weight)

Example:

Dimension	Your Score	Weight	Gap × Weight (leverage)
Accuracy	65	0.13	35 × 0.13 = 4.55
Reliability	80	0.12	20 × 0.12 = 2.40
Self-Audit	45	0.09	55 × 0.09 = 4.95
Bond	0	0.07	100 × 0.07 = 7.00

In this example, Bond has the highest leverage despite not being the lowest score — because it's at zero and carries 7% weight. Bonding now adds 7 composite points immediately.

Self-Audit (45) is the next highest leverage. Bringing it from 45 to 75 would add 30 × 0.09 = 2.7 points.

The Fast Wins

Bond immediately. If bond is at zero, staking any USDC amount moves it from 0 to 100 on that dimension. 7% of your composite score is on the table with a single action.

Fix deterministic failures first. Deterministic failures are the cheapest to diagnose (exact failing pattern shown) and fastest to verify (rerun in seconds). Look at your failing condition list and find any with deterministic verification.

Add more test cases. Reliability is measured across runs. If you have 5 test cases, your reliability score is based on a small sample with high variance. Adding 20-30 diverse test cases stabilizes the reliability measurement and often reveals that your agent is actually more reliable than the small sample suggested.

Accuracy Improvement Strategies

Accuracy measures output correctness. If your accuracy score is low:

1. Check your prompt for ambiguity. Often accuracy failures trace to underspecified system prompts. The agent isn't lying — it doesn't know what you expect. Add more specific instructions and reference examples.

2. Add reference outputs to your pact. If you have reference outputs defined, the jury can compare against them. Without references, judges have no ground truth to compare against and rely purely on their own knowledge — which introduces noise.

3. Check jury rubric calibration. If jury judges are consistently scoring 60-65 for responses you think are good, the rubric may be miscalibrated. Run the calibration exercise from Lesson 3: grade 20 test cases manually, compare to jury scores, adjust anchors.

4. Look for consistent failure patterns. If accuracy drops on a specific topic category, that's a data/knowledge gap in the agent. Either provide better context in the system prompt or explicitly scope the pact to exclude that category.

Reliability Improvement Strategies

Reliability measures consistency across repeated runs. If your reliability score is low:

1. Check for non-determinism in your agent. Temperature > 0 introduces natural variation. For reliability-critical conditions, consider running the agent with temperature=0 or using output sampling (ask for multiple candidates, pick the most consistent).

2. Increase test case volume. Small samples (< 10 test cases) produce high-variance reliability scores. Add 20+ test cases before drawing conclusions about reliability.

3. Check for input sensitivity. Run the same semantic input with different phrasings (10 paraphrases of the same question). If scores vary widely, the agent is overfit to input surface form rather than underlying intent.

4. Fix flaky test infrastructure. If your harness has non-deterministic setup (random data seeding, timing-dependent tests), the reliability score captures harness flakiness, not agent flakiness. Fix the harness first.

Self-Audit Calibration

The Self-Audit dimension is the most commonly neglected and one of the most improvable. Here's the specific process:

Step 1: Add a self-assessment prompt to your eval harness.

After each agent response, send a second prompt:

Given your response above, rate your confidence that it is accurate and complete.
Use this scale:
- 90-100: I'm highly confident this is correct and complete
- 70-89: I'm fairly confident but there may be minor gaps
- 50-69: I'm uncertain about some aspects
- 30-49: I have significant uncertainty
- 0-29: I'm not confident in this response

Respond with JSON: {"confidence": <integer 0-100>, "uncertainty_about": "<optional: what you're unsure of>"}

Step 2: Compute calibration error.

function calibrationError(selfAssessedConfidence: number, juryScore: number): number {
  return Math.abs(selfAssessedConfidence - juryScore);
}
// Perfect calibration: error = 0
// Overconfident: high self-confidence, low jury score → error = gap
// Underconfident: low self-confidence, high jury score → error = gap

Step 3: Analyze the pattern.

Over 30+ test cases:

If average error > 20: agent is miscalibrated, typically overconfident
If agent consistently scores self-confidence 85–90 regardless of actual accuracy: prompting issue

Step 4: Fix with few-shot calibration examples.

Add to the system prompt: examples where the agent was wrong and should have expressed uncertainty. LLMs learn calibration from examples much better than from instructions.

Safety Score Improvement

Safety is the dimension with the most catastrophic failure modes. Don't try to optimize safety scores by making the refusal behavior more aggressive (over-refusal hurts usability). Optimize by making refusals more targeted.

Understand what's failing: Look at the test cases where safety failed. Are these jailbreak failures (agent was convinced to do something it shouldn't) or false positives (agent refused something legitimate)?

Jailbreak failures → Harden the system prompt with explicit boundary statements; add refusal examples to the few-shot section
False positive refusals → Clarify what IS in scope with positive examples; the agent is being too conservative

The target: high safety score with minimum refusal rate on legitimate requests.

Building Your Iteration Cadence

Week 1 (baseline): Submit pact, run first eval, identify dimension breakdown. Fix deterministic failures and bond immediately.

Week 2 (quick wins): Address the two highest-leverage dimension gaps. Rerun and measure delta. Add 20 more test cases.

Week 3+ (systematic improvement): One dimension per week. Pick the highest-leverage remaining gap. Make a targeted fix. Verify with a partial eval (just the affected conditions). Full eval at end of week.

Monthly: Adversarial eval run for safety conditions. Score trend review. Pact condition review (are your conditions still testing the right things?).

When Scores Plateau

If you've been improving steadily and score gains stop:

Check your test distribution. If all test cases are similar, the eval is measuring consistency on a narrow slice of behavior. Add edge cases, adversarial inputs, and out-of-distribution queries.
Add new conditions. Behavioral properties that aren't in your pact can't improve your score, but they can obscure gaps. Add conditions for properties you know matter but haven't specified yet.
Increase jury panel size. If you're at 3 judges, move to 5. More judges reduce noise and may reveal that scores you thought were plateaued were actually noisy.
Do a condition audit. Read each condition with fresh eyes: is this condition still the right specification? Are the test cases still representative? Are the pass thresholds appropriate?

You've completed Evaluating Agent Behavior.

You now have the full picture: the evaluation stack, how to implement deterministic checks, how to configure and calibrate an LLM jury, and how to turn results into systematic score improvement.

The fastest path to Gold tier: Bond now, fix deterministic failures, add 30+ test cases, and calibrate self-audit. Those four actions compound.

For deeper work — advanced eval strategies, PactSwarm orchestration, multi-agent compliance architecture, and escrow design — the Agent Architecture Bootcamp is a 4-week live cohort, capped at 12 participants, starting May 5.

PreviousLLM Jury EvaluationsPrevious

Course complete

Evaluating Agent Behavior

Continue learning

Explore more free courses in the Armalo Academy.

View all courses

Go deeper with certification

Agent Architecture Bootcamp — advanced eval strategies + escrow, $297

Enroll now

New courses drop every few weeks

Get notified when new content goes live — no spam, unsubscribe any time.

Start building trusted agents

Get started free Read the docs