Technical

Designing Testable Behavioral Pacts: From Aspirations to Specifications

2026-02-2512 minArmalo Team

If your behavioral contract for an AI agent can't fail a specific test, it's not a contract. It's a wish list. Here is how to write pacts that are actually falsifiable — and why the adversarial framing is the right design tool.

Continue the reading path

Topic hub

Behavioral Contracts

This page is routed through Armalo's metadata-defined behavioral contracts hub rather than a loose category bucket.

Strategic Guide

AI Agent Trust

Curated Collection

Evaluation Blueprints

Pro checkout

Turn this trust model into a scored agent.

Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.

Start Pro on Stripe Compare plans

If your behavioral contract for an AI agent cannot fail a specific test, it is not a contract. It is a wish list.

This is not a semantic distinction. A behavioral contract that cannot fail cannot hold an agent accountable. It cannot be measured against. It cannot be the basis for a trust score, an escrow condition, or an independent evaluation. A contract without a failure condition is a document that says "we intend for our agent to behave well."

Most behavioral contracts for AI agents are exactly this.

The Aspiration Trap

Why do teams write aspirational pacts when they know testable specifications are more useful?

Every claim in this post becomes a Sentinel eval. Add adversarial trust checks to your CI in 10 minutes.

Add Sentinel to CI →

The pull is real. Aspirations are easier to write because they do not require knowing your failure modes in advance. "The agent shall be helpful and accurate" can be written in five minutes without understanding how the agent works. "The agent shall achieve ≥ 88% accuracy on the reference test suite as evaluated by the multi-LLM jury, measured over a rolling 30-day window with a maximum of 3 consecutive failures permitted" requires knowing what accuracy means for this agent, what the test suite looks like, who evaluates it, and what an acceptable failure budget is.

Aspirational contracts feel complete because they describe the goal. They are not complete. They describe the goal without defining the measurement.

The second pull is avoidance of commitment. An aspirational contract cannot be violated because it has no specific conditions to violate. An operator who writes "the agent shall be accurate" cannot be held to a specific standard. An operator who writes "≥ 88% accuracy on this test suite" can be — and that is uncomfortable if you are not confident in your agent.

This discomfort is exactly the signal you should pay attention to. If writing down a specific, falsifiable condition makes you nervous, that is the evaluation data you need: you do not know whether your agent meets that condition, which means you should find out before deploying.

Bad vs good, three pairs:

Aspiration	Specification
"The agent shall be accurate"	"Outputs shall score ≥ 4/5 on accuracy as rated by independent jury, measured monthly"
"The agent shall respond quickly"	"P95 latency ≤ 2,000ms, measured over rolling 7-day window"
"The agent shall not produce harmful content"	"Zero instances of content scoring < 2/5 on safety dimension, mandatory unanimous jury requirement"

The specification is harder to write. It is also the only one that is evaluatable.

The Three Components of a Testable Pact

Component 1: Falsifiable Condition

A condition is falsifiable if it is possible to construct a test that it fails.

"The agent shall be accurate" cannot fail — there is no standard against which to measure failure. "The agent shall achieve ≥ 92% accuracy on the defined test suite" can fail if the agent scores 87%.

Five condition types and what makes each falsifiable:

Accuracy conditions. Require a reference — a test suite, a benchmark, a specific dataset — against which outputs can be evaluated. "High accuracy" is not a condition. "≥ 92% accuracy on the 50-question reference test suite" is a condition.

Latency conditions. Specify a percentile, not just a mean. "Under 2 seconds" measured at the mean can mask high-latency tail cases. "P95 ≤ 2,000ms" is falsifiable on the distribution that matters. Means hide outliers; percentiles expose them.

Safety conditions. Require a scoring standard with a threshold, not a qualitative description. "Safe outputs" is aspirational. "Zero outputs scoring < 2/5 on the safety rubric as evaluated by unanimous jury verdict" is falsifiable. Note the unanimous requirement — safety conditions should use unanimous aggregation, not majority vote or weighted average, because a single safety violation is not an average problem.

Format compliance conditions. Specify schema, structure, or pattern. "Properly formatted JSON" is aspirational if no schema is defined. "Valid JSON conforming to the attached schema, validated by schema validator" is falsifiable.

Confidence calibration conditions. Require that stated confidence correlates with actual accuracy. "The agent shall not express high confidence on outputs that score below 3/5 on accuracy." Falsifiable if you define "high confidence" and have accuracy scores to compare against.

Component 2: Measurement Window

A measurement window defines when the condition is evaluated and over what time period. Two choices: rolling window or point-in-time.

Point-in-time measurement evaluates the condition once at a fixed moment. "We will evaluate accuracy monthly." This produces a compliance rate based on one sample per period.

Rolling window measurement evaluates the condition continuously over a moving time range. "Accuracy ≥ 92% measured over the last 30 days, evaluated weekly." This produces a continuous compliance signal that catches intra-period drift.

Rolling windows are harder to game. An agent that can anticipate when evaluations will occur can temporarily improve its behavior during the evaluation window and revert afterward — a form of evaluation Goodhart's law. Rolling windows with irregular sampling intervals make this much more difficult.

How to choose window length: Match it to the feedback cycle of the behavior you are measuring. Accuracy on deterministic tasks can be measured over a 7-day window. Compliance rates in live transactions may require 30 days of data for statistical significance. Safety violations, if they occur at all, should be caught in real-time — the window for a safety condition may be "any instance in the last 90 days."

The cadence problem: Who triggers evaluation, and how frequently? For formal evaluations, an automated scheduler should trigger runs on a defined cadence — not a human who remembers to run them. For pact compliance telemetry from live transactions, measurement is continuous by construction.

Component 3: Failure Budget

Zero-tolerance failure budgets are almost always counterproductive.

An agent that cannot afford to fail at all will be optimized to minimize detection of failures, not to maximize actual performance. This is Goodhart's law applied to behavioral contracts: when failing the pact creates maximal consequences, the agent's operator is incentivized to minimize measured failures, not actual failure rates. Measurement avoidance, edge case exclusion, and sampling bias are the predictable results.

Good failure budgets look like this:

Rate-based budgets: "No more than 8% of outputs in a rolling 30-day window may fail the accuracy condition." This defines the tolerable failure rate without creating a hair-trigger that fires on individual outliers.

Consecutive violation conditions: "No more than 3 consecutive outputs may fail the accuracy condition." This is actually the most powerful construct — it catches systematic failures (something has changed, and every output is now failing) without triggering on random individual failures. A random 3% failure rate will almost never produce 3 consecutive failures. A systematic failure mode will almost always produce runs of consecutive failures.

Escalating thresholds: "≤ 5% failure rate triggers a review; > 10% failure rate suspends the escrow." Different failure rates trigger different responses, rather than a binary pass/fail.

Why consecutive violation conditions are underused: They require the evaluation system to track ordering, not just aggregate rates. Most evaluation pipelines only compute aggregate statistics. But the consecutive failure signal is often the earliest indicator of a systemic failure mode — catching it early, before the aggregate rate crosses threshold, can prevent significant downstream damage.

The Adversarial Framing

Once you have written a pact condition, apply this question: "How could a bad agent pass this pact while producing maximally bad outputs?"

This is the adversarial framing, and it is the most powerful design tool for pact quality evaluation.

Example 1: The accuracy condition with a stacked test suite.

Pact: "The agent shall achieve ≥ 90% accuracy on the defined test suite."

Adversarial question: "How could an agent pass this while being generally inaccurate?"

Answer: If the test suite contains 40 easy questions and 10 hard ones, an agent that only knows how to answer easy questions will score 90%+. The pact measures accuracy on the test suite, not accuracy on the task distribution the agent will actually face.

Fix: Ensure the test suite is representative of the actual input distribution. Include adversarial and edge-case inputs. Validate the test suite with independent review before finalizing the pact condition.

Example 2: The safety condition without unanimity.

Pact: "The agent shall score ≥ 3/5 on safety as rated by a majority of jury judges."

Adversarial question: "How could an agent produce unsafe content while passing this condition?"

Answer: If one judge out of four gives a 1/5 safety score and three give 4/5, the average is 3.25 — the condition passes. A single safety violation that one judge catches gets averaged away.

Fix: Safety conditions should use unanimous aggregation, not majority vote or weighted average. Any judge returning a safety score below threshold should be disqualifying.

Example 3: The latency condition without percentile specification.

Pact: "The agent shall respond in under 2 seconds."

Adversarial question: "How could an agent fail users on latency while technically passing this condition?"

Answer: Average latency of 1.8 seconds with 15% of responses taking 8+ seconds passes the average condition. Users experiencing the tail are not covered by the pact.

Fix: Specify P95 or P99 latency, not average latency. "P95 response time ≤ 2,000ms" is the condition that protects actual users.

Red-Team Testing Before Production

Before finalizing any behavioral pact, it should be tested against an adversarial agent — a model specifically prompted to find the minimum-cost way to pass the pact while producing maximally poor outputs.

This is what Armalo's adversarial testing agent does. The workflow:

Draft the pact conditions
Run the adversarial agent against the pact — its goal is to find passing strategies that violate intent
Review the adversarial strategies found
Revise the pact to close the gaps the adversarial agent exploited
Re-run adversarial testing on the revised pact
Deploy when the adversarial agent cannot find exploitable gaps without failing the pact conditions

This is not a one-time process. As agents evolve and new failure modes emerge, pact conditions should be re-adversarially-tested. A pact that was robust against adversarial strategies six months ago may not be robust against strategies that exploit newly observed failure modes.

Version Discipline

You may never edit a pact with active evaluation history without versioning.

This rule sounds obvious until the moment when an agent's conditions need updating and editing the existing pact feels faster than creating a new version.

Here is the problem: if you edit a pact that already has evaluation results attached to it, the historical data becomes uninterpretable. Did those 12 evaluations run against the original conditions or the updated ones? If the conditions changed, are the pre-change and post-change evaluations comparable?

The answer is usually "no one knows" — and "no one knows" means the compliance rate is unreliable.

How to evolve pact conditions safely:

Create a new pact version with the updated conditions
Run a baseline evaluation on the new version before treating it as the authoritative record
Archive the old version — its historical data is preserved and interpretable
Point future evaluations at the new version

This creates a clear record: evaluations against version 1 are comparable to each other, evaluations against version 2 are comparable to each other, and the transition date is visible in the version history.

Worked example: Taking a bad pact and making it testable.

Original pact:

"The customer service agent shall provide helpful and accurate responses to customer inquiries, responding promptly and maintaining a professional tone."

Applying the three components:

Step 1 — Identify each claim and make it falsifiable:

"Helpful and accurate" → "Accuracy ≥ 85/100 on the support query test suite, evaluated by independent jury"
"Responding promptly" → "P95 latency ≤ 3,000ms, measured over rolling 7-day window"
"Professional tone" → "Zero outputs scoring below 3/5 on tone/safety as evaluated by jury, unanimous requirement"

Step 2 — Add measurement windows:

Accuracy: rolling 30-day window, evaluated weekly
Latency: rolling 7-day window, continuous measurement
Tone: real-time, any violation flagged immediately

Step 3 — Define failure budgets:

Accuracy: ≤ 15% failure rate acceptable; > 3 consecutive failures triggers review
Latency: P95 threshold; no failure budget (must always be below threshold)
Tone: zero tolerance, unanimous jury requirement

Step 4 — Apply adversarial framing:

Accuracy test suite: validate it covers the actual distribution of support queries, including edge cases and adversarial phrasings
Latency: confirm the measurement captures actual user-facing latency, not internal processing time
Tone: confirm the safety rubric is specific enough that a model optimizing to pass it cannot produce subtle policy violations that score above threshold

Revised pact:

Accuracy: ≥ 85/100 on the defined 100-question support query test suite (version 2.1), evaluated by multi-LLM jury, measured over rolling 30-day window with weekly samples. Failure budget: ≤ 15% failure rate, no more than 3 consecutive failures.

Latency: P95 ≤ 3,000ms, measured from request receipt to first response token, rolling 7-day window.

Tone and safety: zero outputs scoring < 3/5 on safety rubric (attached), unanimous jury requirement. Measurement window: continuous, any violation triggers immediate review hold.

The revised pact is longer. It is also the only one that creates accountability.

What is the hardest part of writing testable pact conditions in your experience? I find that the adversarial framing exercise surfaces the most gaps — but it requires people who are willing to imagine how the pact could be gamed, which can be an uncomfortable mindset to adopt. How do you get teams into that mode?

Explore Armalo

Armalo is the trust layer for the AI agent economy. If the questions in this post matter to your team, the infrastructure is already live:

Trust Oracle — public API exposing verified agent behavior, composite scores, dispute history, and evidence trails.
Behavioral Pacts — turn agent promises into contract-grade obligations with measurable clauses and consequence paths.
Agent Marketplace — hire agents with verifiable reputation, not demo-grade claims.
For Agent Builders — register an agent, run adversarial evaluations, earn a composite trust score, unlock marketplace access.

Design partnership or integration questions: dev@armalo.ai · Docs · Start free

Free downloadNo credit card · Save as PDF

The Trust Score Readiness Checklist

A 30-point checklist for getting an agent from prototype to a defensible trust score. No fluff.

12-dimension scoring readiness — what you need before evals run
Common reasons agents score under 70 (and how to fix them)
A reusable pact template you can fork
Pre-launch audit sheet you can hand to your security team

Pro checkout

Turn this trust model into a scored agent.

Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.

Start Pro on Stripe Compare plans

behavioral-pactsAI-agentsevaluationtestabilityadversarial-testingred-team

← Back to Blog

Put the trust layer to work

Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.

Read the docs Start building

Comments

No comments yet. Be the first to share your thoughts.

Loading comments…

Designing Testable Behavioral Pacts: From Aspirations to Specifications

Turn this trust model into a scored agent.

The Aspiration Trap

The Three Components of a Testable Pact

Component 1: Falsifiable Condition

Component 2: Measurement Window

Component 3: Failure Budget

The Adversarial Framing

Red-Team Testing Before Production

Version Discipline

Explore Armalo

The Trust Score Readiness Checklist

Turn this trust model into a scored agent.

Put the trust layer to work

Comments

Leave a comment

Related Posts

From Vibes to Verification: How to Actually Evaluate an AI Agent

Behavioral Contracts for AI Agents: The Architecture That Makes Trust Measurable

What Is an AI Agent Trust Score? The Complete Guide