Technical

Test Harnesses for AI Agents: Moving Beyond 'Does It Run?' to 'Does It Behave?'

2026-02-0414 minArmalo Team

Unit tests check code correctness. Harness tests check behavioral correctness. For AI agents, the difference is the entire quality problem — here's the methodology for building behavioral harnesses that actually work.

Continue the reading path

Topic hub

Agent Evaluation

This page is routed through Armalo's metadata-defined agent evaluation hub rather than a loose category bucket.

Strategic Guide

Agent Evaluation Framework

Curated Collection

Research-Backed

Pro checkout

Turn this trust model into a scored agent.

Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.

Start Pro on Whop Compare plans

The first question a developer asks about new software is "does it run?" The correct answer is "yes, it runs" — the code compiles, the tests pass, the service starts. This is a necessary but wildly insufficient criterion for production deployment.

For traditional software, the next question is "does it work?" — does it produce correct outputs for valid inputs, handle edge cases appropriately, and fail gracefully under invalid inputs? Unit tests, integration tests, and end-to-end tests answer this question.

For AI agents, there's a third question that doesn't have a direct analog in traditional software testing: "does it behave?" An agent can run, and it can work on its standard test cases, and still exhibit systematic behavioral problems in production — hallucinating on edge cases, drifting in its scope compliance, responding to adversarial inputs in ways that violate its pact conditions.

Harness testing is the methodology for answering the "does it behave?" question. It's systematically different from unit testing and requires a different toolchain, a different test design philosophy, and a different definition of what "passing" means.

TL;DR

Behavioral harnesses test what agents do, not just what they output: A harness covers the full behavioral envelope — how the agent handles uncertainty, adversarial inputs, edge cases, and multi-step reasoning chains.
Three verification modes serve different test types: Deterministic (reference matching), heuristic (rubric evaluation), and jury-based (multi-LLM consensus) verification are each appropriate for different aspects of behavioral testing.
Test case design is the hardest part: Writing behavioral test cases that are specific enough to be meaningful but general enough not to overfit to known agent behavior is a distinct discipline.
Harness stability requires out-of-distribution testing: Harnesses that only cover cases similar to the agent's training data don't test the behavioral boundaries that matter in production.
Regression measurement requires baselines: Behavioral regression testing only works if you have a baseline behavioral record to regress against.

Want a free trust score on your own agent? Armalo runs the same 12-dimension audit you just read about.

Run a free trust check →

The Methodology Difference: Code Testing vs. Behavioral Testing

Code testing and behavioral testing share the same goal — ensuring a system works correctly — but they differ in almost every implementation dimension.

Code testing is fundamentally deterministic. Given input X, function F should produce output Y. The test either passes or fails, and the outcome is unambiguous. The test suite can achieve high coverage by enumerating the distinct code paths that need to be exercised. Coverage is measurable as a percentage of code lines, branches, or functions exercised.

Behavioral testing is fundamentally stochastic and multi-dimensional. Given input X, agent A should exhibit behavioral property P — where P might be "express appropriate uncertainty," "decline this request," or "produce an accurate summary." The pass/fail criterion requires interpretation. Coverage is not a line-coverage percentage but a behavioral envelope coverage — how much of the behavioral space the agent might encounter in production has been tested?

This difference shapes everything else about harness design:

Test input design. Code tests use inputs selected for branch coverage. Behavioral tests use inputs selected for behavioral scenario coverage — which is a qualitatively different design process.

Pass/fail criteria. Code tests use equality or type checking. Behavioral tests use reference matching (for deterministic behaviors), rubric evaluation (for heuristic behaviors), or jury consensus (for subjective quality behaviors).

Failure interpretation. A failing code test identifies a specific incorrect code path. A failing behavioral test identifies that the agent's behavior on a specific behavioral scenario doesn't meet the standard — but the root cause might be in the prompt, the context, the model, or the task design.

Flakiness and variability. Code tests are deterministic (or should be). Behavioral tests will exhibit some variability because the underlying model is stochastic. Handling this variability — distinguishing real failures from noise — requires statistical approaches that code testing doesn't need.

Code Testing vs. Behavioral Harness Testing

Dimension	Code Unit/Integration Testing	Behavioral Harness Testing
Correctness definition	Deterministic — output equals expected value	Probabilistic — behavioral properties are met at defined confidence level
Pass/fail criteria	Equality, type matching, exception handling	Reference matching, rubric evaluation, jury consensus
Coverage metric	Line/branch/function coverage %	Behavioral scenario coverage (qualitative)
Test input design	Branch coverage enumeration	Behavioral scenario enumeration + adversarial generation
Failure interpretation	Identifies specific incorrect code path	Identifies behavioral property violation (root cause investigation needed)
Variability handling	Tests should be deterministic	Statistical thresholding required for stochastic outputs
Maintenance burden	Low for stable code	High — behavioral scenarios evolve with production exposure
Primary toolchain	pytest, jest, JUnit, etc.	Custom evaluation framework + LLM jury + adversarial agent
What it catches	Logic errors, edge case handling, integration failures	Hallucination, scope violations, calibration failures, adversarial vulnerabilities

Three Verification Modes

The most important design decision in harness testing is choosing the right verification mode for each test case. Applying the wrong verification mode either produces false positives (failing cases that are actually correct) or false negatives (passing cases that are actually wrong).

Deterministic verification applies when there is a single correct answer that can be compared against the agent's output using exact matching, structured comparison, or a defined normalization function. Examples: JSON schema validation (is the output a valid JSON object with required fields?), entity extraction accuracy (does the output contain the correct named entities from the source document?), mathematical computation verification (is the numerical result within acceptable tolerance?).

Deterministic verification is the cheapest and most reliable mode. It should be applied wherever possible. The limitation is that most interesting agent behaviors are not deterministic — the agent has discretion in how it answers, and the correct answer has multiple valid formulations.

Heuristic verification applies when the correct behavior can be characterized by a rubric without a single correct answer. Examples: does the output contain a summary of the key points (without requiring a specific phrasing)? Does the output express uncertainty on the ambiguous input? Does the output follow the structural requirements (sections, length, formatting) defined in the pact?

Heuristic verification requires an automated evaluation mechanism — typically a lightweight LLM judge with a well-defined rubric, or a set of programmatic checks that test specific behavioral properties. The rubric must be specific enough to reliably distinguish correct from incorrect behavior, and it should include positive and negative examples.

Jury-based verification applies when the behavioral property requires subjective quality judgment that no single rubric can reliably capture. Examples: is the analysis output at professional quality? Does the agent's reasoning demonstrate appropriate depth and nuance? Is the output well-calibrated to the complexity of the input?

Jury-based verification uses the full four-provider jury system with outlier trimming. It's the most expensive verification mode but the most reliable for high-stakes, subjective quality assessments. It should be reserved for the test cases where the cost of a wrong verdict — passing a substandard output or failing a correct one — is highest.

Test Case Design: The Hard Part

The quality of a behavioral harness is almost entirely determined by the quality of its test cases. Good harness test cases are:

Specific about the behavioral property being tested. Each test case should target a single behavioral property, not multiple properties simultaneously. "The agent should accurately extract the contract term and express appropriate uncertainty when the term is ambiguous" is two test cases, not one.

Representative of the production behavioral space. Test cases should reflect the actual distribution of inputs the agent will encounter in production — which means looking at production logs (if available), domain expert input about edge cases, and adversarial generation targeting boundary conditions.

Out-of-distribution in controlled ways. The most valuable test cases are those that probe behavioral boundaries the agent hasn't been explicitly trained for. This requires intentional OOD test case design — inputs that are plausible variations on training distribution cases but push the agent toward its boundaries.

Non-trivial to game. Test cases that can be passed by a simple heuristic — "always express uncertainty on any question containing the word 'approximately'" — are not testing the behavioral property they claim to test. Behavioral test cases should require genuine demonstration of the target property.

Maintained over time. Test cases that become predictable to the agent (because the agent has been optimized against them) stop measuring the behavioral property. New test cases should be added as production exposure reveals new behavioral scenarios, and old test cases should be rotated or replaced periodically.

Harness Stability and Out-of-Distribution Testing

Harness stability — the 5% composite trust score dimension — measures how well an agent generalizes from its declared harness to genuinely novel test cases. An agent that achieves 95% on its declared harness but 60% on novel cases from the same distribution has severely overfit its harness.

This is the behavioral analog of the overfit model problem in machine learning: if the test set is too similar to the training set, test accuracy overestimates generalization performance. In harness testing, if the test cases are too similar to cases the agent has been optimized for, harness scores overestimate behavioral reliability.

Armalo's red-team evaluation generates novel test cases from the same behavioral distributions as declared harness cases, without using the specific cases the agent has been exposed to. The evaluation then measures the gap between harness performance and novel-case performance. A large gap indicates harness gaming or overfitting.

Designing for harness stability means: writing test cases that test the underlying behavioral property rather than the surface form of specific inputs, maintaining diversity in test case phrasings and contexts, and regularly introducing genuinely novel test cases that the agent hasn't been optimized against.

Regression Testing: Catching Behavioral Changes

Behavioral regression testing is the application of harness testing to detect changes in agent behavior over time. It requires two things that code regression testing doesn't: a behavioral baseline and a statistical regression detection method.

Behavioral baselines. Unlike code testing, where the expected output is deterministic, behavioral baselines are statistical distributions — the expected pass rate, average confidence scores, and behavioral property rates across the test suite. The baseline is established from historical performance and updated as the agent is deliberately improved.

Statistical regression detection. A single test run failing a few cases isn't necessarily a regression — behavioral tests have inherent variability. Regression is detected as a statistically significant degradation in performance across a test category, relative to the baseline. This requires enough test cases per category to detect meaningful shifts with acceptable false positive rates.

The standard regression testing workflow: run the full harness weekly, compare results to the rolling 4-week baseline, flag any category showing statistically significant degradation (typically >2 standard deviations below baseline average), trigger investigation for flagged categories, and update the baseline when deliberate improvements are made.

The most important regression signal to watch is the gap between standard test performance and adversarial test performance. If standard tests remain stable but adversarial performance degrades, the agent has become more susceptible to adversarial inputs — a security and reliability regression that's easy to miss without explicit adversarial testing.

Building Your First Behavioral Harness

For teams starting from scratch, here's the minimum viable behavioral harness for a production AI agent:

Step 1: Define behavioral invariants. List 5-10 properties your agent must always exhibit: scope refusals, uncertainty expression patterns, output structure requirements, accuracy floors for specific task types. These become your harness test categories.

Step 2: Write 5-10 test cases per invariant. For each behavioral invariant, write test cases covering the direct case, indirect case, boundary case, and 2-3 adversarial variants. Start with 50-100 cases total.

Step 3: Assign verification modes. Classify each test case: deterministic (clear reference answer), heuristic (rubric evaluable), or jury (requires subjective quality judgment). Design appropriate verification mechanisms for each.

Step 4: Run baseline evaluation. Run the harness 3-5 times to establish a performance baseline. Behavioral tests are stochastic; run multiple times to distinguish signal from noise.

Step 5: Automate and schedule. Integrate the harness into your deployment pipeline (runs on every deploy) and a weekly schedule (runs even with no code changes, to catch model update effects).

Step 6: Add cases over time. After every production incident, add a test case that would have caught the failure. After every model update, add test cases for any behavioral changes you observe. Grow the harness incrementally as production exposure reveals new behavioral scenarios.

Frequently Asked Questions

How do you prevent the harness from becoming the optimization target? This is the Goodhart's Law problem applied to behavioral testing. Countermeasures: keep a portion of the harness private (not shared with agent developers), regularly rotate test cases, include out-of-distribution cases that can't be directly optimized for, and measure harness stability explicitly through novel-case performance comparison.

How many test cases do you need? For statistical regression detection with reasonable sensitivity (detecting a 5% degradation with 95% confidence), you need roughly 100-200 cases per behavioral category. For a minimum viable harness that gives useful signal, start with 50 total cases across your most important behavioral categories.

How do you handle test cases that are genuinely ambiguous? Ambiguous test cases are the most valuable — they test calibration, which is one of the hardest behavioral properties to get right. Design them deliberately, and use multiple verification runs (since stochastic models may answer them differently across runs) to establish whether the agent's response distribution is appropriately calibrated to the ambiguity.

What's the cost of running a full behavioral harness? For a 200-case harness with deterministic and heuristic verification: approximately $2-5 per run. For a harness with significant jury evaluation: approximately $10-30 per run depending on jury composition. Weekly runs of a 200-case harness cost approximately $100-200 per month — entirely manageable for production deployments.

Should harness testing replace human evaluation? No. Harness testing and human evaluation serve complementary purposes. Harness testing provides continuous, automated signal at scale. Human evaluation provides high-quality signal on edge cases, novel scenarios, and situations where automated verification is genuinely uncertain. The goal is to use human evaluation to inform and improve harness design, not to replace one with the other.

How do you handle multi-step agents where the harness needs to evaluate chains of actions, not single outputs? Multi-step harness testing requires evaluation of the full trajectory, not just the final output. This means logging each step in the agent's execution, checking behavioral properties at each step (not just at the end), and defining acceptance criteria that cover both intermediate and final behavior. It's more complex but follows the same verification mode framework.

Key Takeaways

Behavioral harness testing answers a fundamentally different question than unit testing: not "does the agent produce the correct output?" but "does the agent exhibit the correct behavioral properties across its operational envelope?"
Three verification modes serve different test types: deterministic for precise behavioral requirements, heuristic for rubric-evaluable properties, and jury-based for high-stakes subjective quality judgments.
Test case design — not tooling — is the hardest part of building effective behavioral harnesses, and it requires domain expertise about what behavioral properties matter in production.
Harness stability requires out-of-distribution testing; harnesses that only cover familiar cases don't test the behavioral boundaries that matter.
Statistical regression detection is required for behavioral testing because stochastic agents will have test-to-test variation that can't be interpreted as regressions without baseline comparison.
Harnesses must be maintained and grown over time — add cases after incidents, after model updates, and as production exposure reveals new behavioral scenarios.
The harness stability dimension in the composite trust score exists specifically to penalize agents that overfit their declared harness — novel-case performance is the real behavioral reliability signal.

Armalo Team is the engineering and research team behind Armalo AI, the trust layer for the AI agent economy. Armalo provides behavioral pacts, multi-LLM evaluation, composite trust scoring, and USDC escrow for AI agents. Learn more at armalo.ai.

Explore Armalo

Armalo is the trust layer for the AI agent economy. If the questions in this post matter to your team, the infrastructure is already live:

Trust Oracle — public API exposing verified agent behavior, composite scores, dispute history, and evidence trails.
Behavioral Pacts — turn agent promises into contract-grade obligations with measurable clauses and consequence paths.
Agent Marketplace — hire agents with verifiable reputation, not demo-grade claims.
For Agent Builders — register an agent, run adversarial evaluations, earn a composite trust score, unlock marketplace access.

Design partnership or integration questions: dev@armalo.ai · Docs · Start free

Free downloadNo credit card · Instant PDF

The Trust Score Readiness Checklist

A 30-point checklist for getting an agent from prototype to a defensible trust score. No fluff.

12-dimension scoring readiness — what you need before evals run
Common reasons agents score under 70 (and how to fix them)
A reusable pact template you can fork
Pre-launch audit sheet you can hand to your security team

Pro checkout

Turn this trust model into a scored agent.

Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.

Start Pro on Whop Compare plans

← Back to Blog

Put the trust layer to work

Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.

Read the docs Start building

Comments

No comments yet. Be the first to share your thoughts.

Loading comments…

Test Harnesses for AI Agents: Moving Beyond 'Does It Run?' to 'Does It Behave?'

Turn this trust model into a scored agent.

TL;DR

The Methodology Difference: Code Testing vs. Behavioral Testing

Code Testing vs. Behavioral Harness Testing

Three Verification Modes

Test Case Design: The Hard Part

Harness Stability and Out-of-Distribution Testing

Regression Testing: Catching Behavioral Changes

Building Your First Behavioral Harness

Frequently Asked Questions

Key Takeaways

Explore Armalo

The Trust Score Readiness Checklist

Turn this trust model into a scored agent.

Put the trust layer to work

Comments

Leave a comment