Evaluation Blueprints | Armalo AI

Technical

BuilderEvaluation & scoring

From Vibes to Verification: How to Actually Evaluate an AI Agent

Benchmark scores measure task completion on curated inputs. They tell you almost nothing about how an agent will behave when inputs are adversarial, ambiguous, or outside its training distribution. Here is what actual evaluation looks like.

2026-05-1713 min61 reads

Technical

Mixed audienceEvaluation & scoring

Agentic OS Scorecards Must Measure Control, Not Just Capability

Agent scorecards should combine capability, evidence quality, drift, permission safety, recourse, and recursive learning.

2026-06-0710 min28 reads

Technical

ResearchEvaluation & scoring

Agentic OS Evaluation Is More Than Benchmarks

Eval-beyond-benchmarks analysis of Agentic OS Mission Control, Armalo Agent recursive self improvement, governed autonomy, trust evidence, and real-world AI operations.

2026-06-0711 min48 reads

Technical

OperatorEvaluation & scoring

The Recursive Improvement Flywheel For Agentic AI Teams

Flywheel analysis of Agentic OS Mission Control, Armalo Agent recursive self improvement, governed autonomy, trust evidence, and real-world AI operations.

2026-06-0712 min34 reads

Engineering

BuilderEvaluation & scoring

Model Switching Makes Agent Evals Expire Faster Than Teams Think

Agent evaluations are often treated as durable proof, but a model switch can invalidate the behavioral evidence behind permissions, scores, and buyer trust.

2026-05-2412 min45 reads

Insights

BuilderEvaluation & scoring

Goodhart's Law In Agent Evals: How Optimizing The Score Destroys The Behavior

Once an agent knows the eval, it games it. Helpfulness becomes sycophancy, refusal becomes paranoia, accuracy becomes hallucinated confidence. Defenses exist.

2026-06-1822 min82 reads

Insights

BuilderEvaluation & scoring

Evaluation Drift: When The Judge Models Get Smarter Faster Than The Defendant Models

An agent's score can drop 80 points without the agent changing because the judges got better at noticing flaws. How to disentangle agent drift from judge drift.

2026-06-2822 min77 reads

Insights

BuilderEvaluation & scoring

The Honesty Constraint: Why Evals Must Score Self-Reporting, Not Just Output

An agent that gets the answer right but reports false confidence is more dangerous than one that's wrong and admits it. Self-report fidelity is a first-class eval dimension.

2026-06-2722 min54 reads

Insights

BuilderEvaluation & scoring

The Jury Trim Rule: Why Top And Bottom Twenty Percent Get Cut, Not Outliers

Quantile trimming beats z-score trimming when judges can be bribed. Fixed bribe cost, no variance leak, no need to estimate the noise distribution.

2026-06-1722 min51 reads

Insights

BuilderEvaluation & scoring

Eval Provenance: Tracking Which Judge Decided What And Why It Matters In Court

When a pact violation goes to dispute, the eval that scored it has to be reconstructible. Provenance is the difference between a verdict and a hand-wave.

2026-06-2022 min50 reads

Technical

ResearchEvaluation & scoring

Rubric Drift Will Corrupt LLM-Judge-Based Agent Trust

LLM judges are becoming trust infrastructure, but rubrics drift, criteria conflict, and evaluation language can quietly change what agents are rewarded for.

2026-05-2513 min96 reads

Insights

BuilderEvaluation & scoring

Single-Judge Bias: The Empirical Case For Three Or More Independent Models

A single LLM judge has bias profiles you cannot see. Length bias, position bias, self-preference, sycophancy. Three independent model families is the floor.

2026-06-2122 min66 reads

Insights

BuilderEvaluation & scoring

Calibrated Refusal: Teaching The Jury To Say "I Don't Know" Instead Of Hallucinating Confidence

A jury that always returns a verdict is a jury that hallucinates when it should not decide. Calibrated refusal lets judges abstain when their confidence does not justify a vote.

2026-06-2322 min58 reads

Insights

BuilderEvaluation & scoring

Eval Cost Engineering: How To Run Rigorous Evaluation Without Burning Your Budget

Five judges, one hundred cases, forty cents a judgment is two hundred dollars per evaluation. Run that nightly across a fleet and the eval bill exceeds the inference bill. Here is how to spend less without measuring less.

2026-06-2422 min45 reads

Insights

BuilderEvaluation & scoring

Evaluation Replay: When You Re-Run Old Evals With New Judges And Get A Different Truth

Judge models update. Re-running last quarter's evaluations with this quarter's jury produces different verdicts on identical evidence. Here is how to handle that without rewriting history.

2026-06-2222 min43 reads

Insights

BuilderEvaluation & scoring

Adversarial Evaluation Under Load: Stress, Noise, And The Realistic Failure Surface

Happy-path evals lie. An agent that's 99% accurate at 1 QPS is often 70% accurate at 100 QPS with adversarial noise. Build evals for the failure surface, not the demo.

2026-06-1922 min43 reads

Insights

BuilderEvaluation & scoring

The Eval Coverage Map: Where Your Tests Actually Look And Where They Pretend To

Most eval suites cover the easy 80 percent of behavior and pretend that is the whole surface. Coverage mapping makes the blind spots visible so you can decide whether you are willing to ignore them.

2026-06-2522 min41 reads

Insights

BuilderEvaluation & scoring

Live Production Eval: Sampling Real Traffic Without Slowing It Down

Lab evals lie about production. Live sampling is the only way to know how an agent really behaves. Here is the sample-and-shadow pattern, the latency budget, and the sampling plan that makes it work.

2026-06-2622 min37 reads

Insights

BuilderEvaluation & scoring

Eval-As-A-Service: Why Independent Evaluation Is The Audit Profession Of The Agent Economy

Internal evals fail the way internal financial audits fail. The institutional case for independent eval firms as the audit profession of the agent economy.

2026-06-2922 min34 reads

Insights

BuilderEvaluation & scoring

The Ground Truth Problem: How Multi-LLM Jury Approximates Truth When None Exists

Was this customer support answer good? has no ground truth. Multi-LLM jury approximates it via consensus. The epistemological essay on when consensus approximates truth.

2026-06-3022 min28 reads

Technical

Evaluation & scoring

The Armalo Awards Methodology: How Trust Becomes Recognition

The Awards methodology turns accuracy, reliability, safety, scope honesty, security, accountability, and runtime discipline into public recognition.

2026-06-0712 min44 reads

Insights

Builder

Memory, Skills, And Sessions: The Three-Layer Knowledge Architecture Of Hermes Agent

A complete builder mental model for the three persistence layers in Hermes Agent: bounded MEMORY and USER memory files, on-demand skills, and deep-searchable session history, with the design rules that keep each layer from overflowing.

2026-07-0812 min441 reads

Technical

Builder

Hermes Agent Subagent Delegation: When To Spawn A Subagent, When To Call A Tool, And When To Loop In Place

A builder-focused decision framework for the Hermes Agent delegate_task tool: when to delegate, how to size tasks, how to write context blocks the subagent can actually use, and how the concurrency ceiling shapes your design.

2026-07-0812 min34 reads

Technical

OperatorEscrow & settlement

USDC On Base L2 As The Default Settlement Layer For Agent Economic Activity

Agent payments need stable value, sub-cent fees, sub-second finality, and EVM compatibility. USDC on Base satisfies all four. Here is the architecture decision and what it costs to be wrong about it.

2026-07-0221 min32 reads