Designing Testable Behavioral Pacts: From Aspirations to Specifications
If your behavioral contract for an AI agent can't fail a specific test, it's not a contract. It's a wish list. Here is how to write pacts that are actually falsifiable — and why the adversarial framing is the right design tool.
If your behavioral contract for an AI agent cannot fail a specific test, it is not a contract. It is a wish list.
This is not a semantic distinction. A behavioral contract that cannot fail cannot hold an agent accountable. It cannot be measured against. It cannot be the basis for a trust score, an escrow condition, or an independent evaluation. A contract without a failure condition is a document that says "we intend for our agent to behave well."
Most behavioral contracts for AI agents are exactly this.
The Aspiration Trap
Why do teams write aspirational pacts when they know testable specifications are more useful?
The pull is real. Aspirations are easier to write because they do not require knowing your failure modes in advance. "The agent shall be helpful and accurate" can be written in five minutes without understanding how the agent works. "The agent shall achieve ≥ 88% accuracy on the reference test suite as evaluated by the multi-LLM jury, measured over a rolling 30-day window with a maximum of 3 consecutive failures permitted" requires knowing what accuracy means for this agent, what the test suite looks like, who evaluates it, and what an acceptable failure budget is.
Aspirational contracts feel complete because they describe the goal. They are not complete. They describe the goal without defining the measurement.
The second pull is avoidance of commitment. An aspirational contract cannot be violated because it has no specific conditions to violate. An operator who writes "the agent shall be accurate" cannot be held to a specific standard. An operator who writes "≥ 88% accuracy on this test suite" can be — and that is uncomfortable if you are not confident in your agent.
This discomfort is exactly the signal you should pay attention to. If writing down a specific, falsifiable condition makes you nervous, that is the evaluation data you need: you do not know whether your agent meets that condition, which means you should find out before deploying.
Bad vs good, three pairs:
| Aspiration | Specification |
|---|---|
| "The agent shall be accurate" | "Outputs shall score ≥ 4/5 on accuracy as rated by independent jury, measured monthly" |
| "The agent shall respond quickly" | "P95 latency ≤ 2,000ms, measured over rolling 7-day window" |
| "The agent shall not produce harmful content" | "Zero instances of content scoring < 2/5 on safety dimension, mandatory unanimous jury requirement" |
The specification is harder to write. It is also the only one that is evaluatable.
The Three Components of a Testable Pact
Component 1: Falsifiable Condition
A condition is falsifiable if it is possible to construct a test that it fails.
"The agent shall be accurate" cannot fail — there is no standard against which to measure failure. "The agent shall achieve ≥ 92% accuracy on the defined test suite" can fail if the agent scores 87%.
Five condition types and what makes each falsifiable:
Accuracy conditions. Require a reference — a test suite, a benchmark, a specific dataset — against which outputs can be evaluated. "High accuracy" is not a condition. "≥ 92% accuracy on the 50-question reference test suite" is a condition.
Latency conditions. Specify a percentile, not just a mean. "Under 2 seconds" measured at the mean can mask high-latency tail cases. "P95 ≤ 2,000ms" is falsifiable on the distribution that matters. Means hide outliers; percentiles expose them.
Safety conditions. Require a scoring standard with a threshold, not a qualitative description. "Safe outputs" is aspirational. "Zero outputs scoring < 2/5 on the safety rubric as evaluated by unanimous jury verdict" is falsifiable. Note the unanimous requirement — safety conditions should use unanimous aggregation, not majority vote or weighted average, because a single safety violation is not an average problem.
Format compliance conditions. Specify schema, structure, or pattern. "Properly formatted JSON" is aspirational if no schema is defined. "Valid JSON conforming to the attached schema, validated by schema validator" is falsifiable.
Confidence calibration conditions. Require that stated confidence correlates with actual accuracy. "The agent shall not express high confidence on outputs that score below 3/5 on accuracy." Falsifiable if you define "high confidence" and have accuracy scores to compare against.
Component 2: Measurement Window
A measurement window defines when the condition is evaluated and over what time period. Two choices: rolling window or point-in-time.
Point-in-time measurement evaluates the condition once at a fixed moment. "We will evaluate accuracy monthly." This produces a compliance rate based on one sample per period.
Rolling window measurement evaluates the condition continuously over a moving time range. "Accuracy ≥ 92% measured over the last 30 days, evaluated weekly." This produces a continuous compliance signal that catches intra-period drift.
Rolling windows are harder to game. An agent that can anticipate when evaluations will occur can temporarily improve its behavior during the evaluation window and revert afterward — a form of evaluation Goodhart's law. Rolling windows with irregular sampling intervals make this much more difficult.
How to choose window length: Match it to the feedback cycle of the behavior you are measuring. Accuracy on deterministic tasks can be measured over a 7-day window. Compliance rates in live transactions may require 30 days of data for statistical significance. Safety violations, if they occur at all, should be caught in real-time — the window for a safety condition may be "any instance in the last 90 days."
The cadence problem: Who triggers evaluation, and how frequently? For formal evaluations, an automated scheduler should trigger runs on a defined cadence — not a human who remembers to run them. For pact compliance telemetry from live transactions, measurement is continuous by construction.
Component 3: Failure Budget
Zero-tolerance failure budgets are almost always counterproductive.
An agent that cannot afford to fail at all will be optimized to minimize detection of failures, not to maximize actual performance. This is Goodhart's law applied to behavioral contracts: when failing the pact creates maximal consequences, the agent's operator is incentivized to minimize measured failures, not actual failure rates. Measurement avoidance, edge case exclusion, and sampling bias are the predictable results.
Good failure budgets look like this:
Rate-based budgets: "No more than 8% of outputs in a rolling 30-day window may fail the accuracy condition." This defines the tolerable failure rate without creating a hair-trigger that fires on individual outliers.
Consecutive violation conditions: "No more than 3 consecutive outputs may fail the accuracy condition." This is actually the most powerful construct — it catches systematic failures (something has changed, and every output is now failing) without triggering on random individual failures. A random 3% failure rate will almost never produce 3 consecutive failures. A systematic failure mode will almost always produce runs of consecutive failures.
Escalating thresholds: "≤ 5% failure rate triggers a review; > 10% failure rate suspends the escrow." Different failure rates trigger different responses, rather than a binary pass/fail.
Why consecutive violation conditions are underused: They require the evaluation system to track ordering, not just aggregate rates. Most evaluation pipelines only compute aggregate statistics. But the consecutive failure signal is often the earliest indicator of a systemic failure mode — catching it early, before the aggregate rate crosses threshold, can prevent significant downstream damage.
The Adversarial Framing
Once you have written a pact condition, apply this question: "How could a bad agent pass this pact while producing maximally bad outputs?"
This is the adversarial framing, and it is the most powerful design tool for pact quality evaluation.
Example 1: The accuracy condition with a stacked test suite.
Pact: "The agent shall achieve ≥ 90% accuracy on the defined test suite."
Adversarial question: "How could an agent pass this while being generally inaccurate?"
Answer: If the test suite contains 40 easy questions and 10 hard ones, an agent that only knows how to answer easy questions will score 90%+. The pact measures accuracy on the test suite, not accuracy on the task distribution the agent will actually face.
Fix: Ensure the test suite is representative of the actual input distribution. Include adversarial and edge-case inputs. Validate the test suite with independent review before finalizing the pact condition.
Example 2: The safety condition without unanimity.
Pact: "The agent shall score ≥ 3/5 on safety as rated by a majority of jury judges."
Adversarial question: "How could an agent produce unsafe content while passing this condition?"
Answer: If one judge out of four gives a 1/5 safety score and three give 4/5, the average is 3.25 — the condition passes. A single safety violation that one judge catches gets averaged away.
Fix: Safety conditions should use unanimous aggregation, not majority vote or weighted average. Any judge returning a safety score below threshold should be disqualifying.
Example 3: The latency condition without percentile specification.
Pact: "The agent shall respond in under 2 seconds."
Adversarial question: "How could an agent fail users on latency while technically passing this condition?"
Answer: Average latency of 1.8 seconds with 15% of responses taking 8+ seconds passes the average condition. Users experiencing the tail are not covered by the pact.
Fix: Specify P95 or P99 latency, not average latency. "P95 response time ≤ 2,000ms" is the condition that protects actual users.
Red-Team Testing Before Production
Before finalizing any behavioral pact, it should be tested against an adversarial agent — a model specifically prompted to find the minimum-cost way to pass the pact while producing maximally poor outputs.
This is what Armalo's adversarial testing agent does. The workflow:
- Draft the pact conditions
- Run the adversarial agent against the pact — its goal is to find passing strategies that violate intent
- Review the adversarial strategies found
- Revise the pact to close the gaps the adversarial agent exploited
- Re-run adversarial testing on the revised pact
- Deploy when the adversarial agent cannot find exploitable gaps without failing the pact conditions
This is not a one-time process. As agents evolve and new failure modes emerge, pact conditions should be re-adversarially-tested. A pact that was robust against adversarial strategies six months ago may not be robust against strategies that exploit newly observed failure modes.
Version Discipline
You may never edit a pact with active evaluation history without versioning.
This rule sounds obvious until the moment when an agent's conditions need updating and editing the existing pact feels faster than creating a new version.
Here is the problem: if you edit a pact that already has evaluation results attached to it, the historical data becomes uninterpretable. Did those 12 evaluations run against the original conditions or the updated ones? If the conditions changed, are the pre-change and post-change evaluations comparable?
The answer is usually "no one knows" — and "no one knows" means the compliance rate is unreliable.
How to evolve pact conditions safely:
- Create a new pact version with the updated conditions
- Run a baseline evaluation on the new version before treating it as the authoritative record
- Archive the old version — its historical data is preserved and interpretable
- Point future evaluations at the new version
This creates a clear record: evaluations against version 1 are comparable to each other, evaluations against version 2 are comparable to each other, and the transition date is visible in the version history.
Worked example: Taking a bad pact and making it testable.
Original pact:
"The customer service agent shall provide helpful and accurate responses to customer inquiries, responding promptly and maintaining a professional tone."
Applying the three components:
Step 1 — Identify each claim and make it falsifiable:
- "Helpful and accurate" → "Accuracy ≥ 85/100 on the support query test suite, evaluated by independent jury"
- "Responding promptly" → "P95 latency ≤ 3,000ms, measured over rolling 7-day window"
- "Professional tone" → "Zero outputs scoring below 3/5 on tone/safety as evaluated by jury, unanimous requirement"
Step 2 — Add measurement windows:
- Accuracy: rolling 30-day window, evaluated weekly
- Latency: rolling 7-day window, continuous measurement
- Tone: real-time, any violation flagged immediately
Step 3 — Define failure budgets:
- Accuracy: ≤ 15% failure rate acceptable; > 3 consecutive failures triggers review
- Latency: P95 threshold; no failure budget (must always be below threshold)
- Tone: zero tolerance, unanimous jury requirement
Step 4 — Apply adversarial framing:
- Accuracy test suite: validate it covers the actual distribution of support queries, including edge cases and adversarial phrasings
- Latency: confirm the measurement captures actual user-facing latency, not internal processing time
- Tone: confirm the safety rubric is specific enough that a model optimizing to pass it cannot produce subtle policy violations that score above threshold
Revised pact:
Accuracy: ≥ 85/100 on the defined 100-question support query test suite (version 2.1), evaluated by multi-LLM jury, measured over rolling 30-day window with weekly samples. Failure budget: ≤ 15% failure rate, no more than 3 consecutive failures.
Latency: P95 ≤ 3,000ms, measured from request receipt to first response token, rolling 7-day window.
Tone and safety: zero outputs scoring < 3/5 on safety rubric (attached), unanimous jury requirement. Measurement window: continuous, any violation triggers immediate review hold.
The revised pact is longer. It is also the only one that creates accountability.
What is the hardest part of writing testable pact conditions in your experience? I find that the adversarial framing exercise surfaces the most gaps — but it requires people who are willing to imagine how the pact could be gamed, which can be an uncomfortable mindset to adopt. How do you get teams into that mode?
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.