How to Build a Behavioral Pact: A Technical Guide for AI Agent Developers
A step-by-step technical guide to building behavioral pacts for AI agents. What makes a good pact condition, how to choose verification methods, and example pacts for 5 common agent types.
A behavioral pact is a machine-readable, automatically-verifiable contract that defines what an AI agent does, under what conditions, with what success criteria. It's the technical instrument that makes agent accountability possible — not as a policy document, but as a live contract that's evaluated continuously against actual agent performance.
Most developers encounter pacts when they're trying to deploy an agent for enterprise use and their procurement contact asks for "documentation of behavioral guarantees." At that point, they write something vague in a Confluence page and call it done. This is the wrong approach, and it doesn't serve the developer's interests — a vague pact is unverifiable, which means it provides no credibility value and no accountability protection.
A well-constructed pact does several things simultaneously. It creates the evaluation criteria that your continuous monitoring will use to detect drift. It provides the evidentiary basis for trust claims you make to buyers. It defines the ground truth for dispute resolution if a buyer claims your agent underperformed. And it forces the engineering discipline of specifying exactly what your agent should do, which usually reveals design gaps.
This guide walks through the complete process of building a behavioral pact — from identifying what conditions to declare, through selecting verification methods, to writing test cases. We'll cover the API and show example pacts for five common agent types.
TL;DR
- Pact conditions must be specific, measurable, and verifiable: "High quality outputs" is not a pact condition. "Accuracy above 90% on the test case set for domain X, measured by deterministic comparison to reference outputs" is.
- Verification method selection determines evaluation cost and confidence: Deterministic verification is cheapest and most reliable; LLM jury is most powerful for subjective tasks; heuristic checks are a practical middle ground.
- Anti-patterns in pact design cause disputes and false positives: The most common are: conditions that only apply to the happy path, success criteria that don't match the buyer's actual requirements, and test cases that are too narrow to generalize.
- The Armalo API handles the full pact lifecycle: Registration, versioning, evaluation triggering, result recording, and dispute resolution are all supported.
- Example pacts for customer service, data analysis, code generation, research, and financial agents show the full range of design patterns.
What Makes a Good Pact Condition
Before discussing how to write pact conditions, it's worth being precise about what you're doing: you're specifying a commitment that will be automatically evaluated. Every word in your pact condition becomes a parameter in an evaluation function. Vague language produces unmeasurable conditions that can't be evaluated and can't be disputed.
The four required attributes of a valid pact condition:
1. Specificity: The condition must describe a specific, concrete agent behavior — not a capability category. "Handles customer inquiries about product returns" is not specific. "Provides accurate information about the return policy for the SKU in the customer's order, with accuracy defined as matching the authoritative policy document for that SKU category, within 2 sentences of relevant policy text" is specific.
2. Measurability: The condition must produce a measurable outcome. Measurable outcomes are either deterministic (the output matches a reference or satisfies a predicate) or ratable (a jury can score the output on a defined scale). "Responds professionally" is not measurable. "Responses rated 4+ by a multi-LLM jury on the professionalism rubric, with confidence > 0.7" is measurable.
3. Verifiability: The measurement method must be achievable with available infrastructure. A condition that requires manual human review of every output isn't operationally verifiable at scale. Conditions should be designed for automated verification with human escalation for edge cases.
4. Completeness: The condition should cover the failure modes you care about, not just the success path. A data transformation pact that only specifies accuracy on valid inputs without specifying behavior on invalid inputs will fail silently when the agent receives malformed input.
Common Anti-Patterns to Avoid
The vague qualifier: "High accuracy," "good performance," "appropriate responses." These terms have no operational definition. Replace with specific thresholds and measurement methods.
The happy-path-only condition: Specifying only what the agent does when everything goes right, without specifying behavior under failure conditions. Agents spend a lot of their real-world operating time handling edge cases. If your pact doesn't cover them, you have no recourse when they fail.
The unmeasurable outcome: "Users find responses helpful." Helpfulness requires a rated measurement scale and a defined measurement population. "Rated 4+ on helpfulness by LLM jury on a representative sample of real user queries, with 80% of samples meeting threshold" is a measurable version.
The overly narrow test case set: Test cases that only cover a thin slice of your production distribution. An accuracy test with 20 test cases in one subdomain is not representative of an agent deployed across 10 subdomains. Your test cases should be representative or your pact claims are about a different agent than the one you're deploying.
Verification Method Selection
The verification method determines how pact conditions are evaluated. Selection matters because different methods have different cost profiles, different confidence levels, and different applicability ranges.
Deterministic verification compares the agent's output against a reference answer using an exact match, a fuzzy match, or a predicate function. It's the cheapest, fastest, and most reliable method — but it only applies to tasks where a definitive correct answer exists. Use it for: factual lookups, data transformations, format validation, mathematical computations, and any task where ground truth can be specified exactly.
Heuristic verification applies rule-based checks to outputs — length constraints, format compliance, keyword presence/absence, structural validation. It's faster and cheaper than LLM jury but less powerful. Use it for: format compliance, policy violations (checking that outputs don't contain prohibited content), structural requirements (responses must include specific sections), and as a pre-filter before LLM jury evaluation.
LLM jury verification submits outputs to multiple independent LLM judges who score them against a defined rubric. It's the most powerful method for subjective tasks — relevance, professionalism, coherence, appropriateness — and for tasks where correct answers are contextually dependent. Use it for: open-ended question answering, writing quality, domain-appropriate reasoning, and any task where "correct" requires judgment. Armalo's jury uses multiple providers (Anthropic, OpenAI, Google) with outlier trimming and a configurable consensus threshold.
Human verification is the most expensive and slowest method and should be reserved for: high-stakes decisions where automated verification is insufficient, calibration runs to establish ground truth for jury rubrics, and dispute resolution where the automated result is contested.
| Verification Method | Best For | Cost Level | Confidence Level | Latency |
|---|---|---|---|---|
| Deterministic | Factual tasks, data transforms, format checks | Very Low | Very High | Milliseconds |
| Heuristic | Format compliance, content policies, structural checks | Low | Medium-High | Milliseconds |
| LLM Jury | Subjective quality, reasoning, appropriateness | Medium | High | Seconds-minutes |
| Human | High-stakes decisions, calibration, dispute resolution | High | Very High | Hours-days |
Example Pacts for 5 Common Agent Types
1. Customer Service Agent
A customer service agent needs conditions covering both factual accuracy (answering policy questions correctly) and interaction quality (responding appropriately to different emotional states).
Condition 1 (deterministic): "Returns accurate policy information for product return queries, defined as: the return window stated matches the authoritative policy for the product category, with no contradictory information. Success threshold: >95% of queries in the test case set."
Condition 2 (LLM jury): "Responses to queries are rated 4+/5 on the customer satisfaction rubric by multi-LLM jury, with rubric dimensions: (1) accuracy of information, (2) clarity of explanation, (3) appropriateness of tone, (4) completeness (addresses all questions asked). Success threshold: >85% of jury-evaluated responses meet 4+."
Condition 3 (heuristic): "Responses do not contain prohibited phrases (competitor names, unauthorized pricing commitments, personal medical or legal advice). Success threshold: 100% compliance."
Condition 4 (failure mode, heuristic + jury): "When the agent cannot answer a query, it explicitly acknowledges the limitation and provides the escalation path. Success threshold: >99% of unresolvable queries include explicit escalation."
2. Financial Analysis Agent
Financial agents require higher accuracy thresholds and specific coverage of calculation methods.
Condition 1 (deterministic): "Financial calculations (DCF, NPV, IRR, ratio analysis) are mathematically correct to within 0.01% of reference calculations using the same inputs and the declared formula methodology. Success threshold: 99.5% of calculations in the test set."
Condition 2 (LLM jury, with reference): "Interpretive narratives describing financial results match the analytical conclusions that a CFA-level analyst would draw from the same data, with no contradictory conclusions or material omissions. Jury rubric includes: factual consistency with provided data, appropriate hedging of uncertain claims, and accuracy of comparative analysis. Success threshold: >90% of outputs rated 4+/5."
Condition 3 (heuristic): "Outputs do not contain forward-looking statements that would constitute financial advice under SEC Rule 206(4)-7 without appropriate disclaimers. Detected by pattern matching against prohibited statement templates. Success threshold: 100% compliance."
3. Code Generation Agent
Code generation pacts require both functional correctness checks and quality standards.
Condition 1 (deterministic, automated test execution): "Generated code passes all unit tests in the test harness with a pass rate of >95% across the test case distribution. Tests are run in the declared sandbox environment."
Condition 2 (heuristic + deterministic): "Generated code contains no security vulnerabilities in the OWASP Top 10 categories, as detected by automated SAST scanning. Success threshold: 0 critical or high severity findings."
Condition 3 (LLM jury): "Generated code meets the readability and maintainability standards in the style guide, evaluated by multi-LLM jury on: naming conventions, comment coverage, function length, and complexity. Success threshold: >80% of outputs rated 3+/5 on maintainability."
4. Research Agent
Research agents require careful handling of source attribution and factual claims.
Condition 1 (LLM jury with source verification): "Research summaries accurately represent the source material, with no material omissions or distortions of the sources' positions. Jury evaluates both factual accuracy (checked against provided sources) and completeness (all major relevant findings addressed). Success threshold: >90% rated 4+/5."
Condition 2 (heuristic + deterministic): "All factual claims include citations to specific sources with publication date and author. Citation format matches the declared style guide. Success threshold: >99% of claims include citations; 0 citation format violations."
Condition 3 (LLM jury): "Research is conducted without confirmation bias — a jury of 5 models assesses whether sources have been selected to support a predetermined conclusion rather than to represent the full evidence base. Success threshold: <10% of research outputs flagged as showing significant confirmation bias."
5. Workflow Orchestration Agent
Orchestration agents need conditions covering both task routing accuracy and end-to-end workflow completion.
Condition 1 (deterministic): "Tasks are routed to the correct downstream agent based on the task classification rubric with >97% accuracy on the test case distribution."
Condition 2 (heuristic): "All task handoffs include the required context fields (priority, deadline, authorization token, relevant history reference). Success threshold: 100% field completion on all handoffs."
Condition 3 (deterministic, end-to-end): "Multi-step workflows complete successfully end-to-end with >92% success rate on the workflow test harness, where success is defined as all required outputs produced with no required steps skipped."
The Armalo Pact API
Creating a pact via the Armalo API:
POST /api/v1/pacts
{
"agentId": "agt_xyz",
"name": "Customer Service Response Quality",
"version": "1.2.0",
"conditions": [
{
"name": "Policy accuracy",
"description": "Returns accurate return policy information",
"verificationMethod": "deterministic",
"successCriteria": "Output matches policy reference for product category with 0 contradictions",
"measurementWindow": "30d",
"threshold": 0.95,
"testCases": [...]
}
]
}
The API supports versioning (pact conditions can be updated with version increments), condition-level evaluation triggering (run eval on a specific condition independently), bulk test case upload (CSV or JSON format), and dispute escalation (route a specific evaluation result to human review).
Frequently Asked Questions
How many pact conditions is the right number? Between 3-8 conditions is practical. Fewer than 3 and you're probably not covering your main failure modes. More than 8 and you're either being redundant or splitting conditions at too fine a granularity. Each condition should cover a genuinely distinct capability or failure mode.
Should pact conditions cover edge cases or only the main path? Both, explicitly. The most expensive agent failures happen on edge cases. Your pact should include at least one condition that specifies agent behavior when the task is out of scope, the data is malformed, or the downstream tool is unavailable. These conditions protect you as the operator by documenting expected behavior under failure conditions.
How do you handle pact conditions that conflict with each other? Add a precedence specification to the conflicting conditions. "In cases where conditions X and Y specify different behaviors, condition X takes precedence." This is common when accuracy and speed conditions conflict — specifying explicitly that accuracy takes precedence over latency in cases of conflict is better than leaving it undefined.
Can pact conditions be retroactively modified? Conditions should be versioned, not retroactively modified. If a condition needs to change, create a new version of the pact with the updated condition and specify the effective date. Retroactive modification of pact conditions undermines the evidentiary value of historical evaluation results.
How do you write test cases for LLM jury verification? LLM jury test cases should include: the input, an explicit description of what characteristics the jury should evaluate (the rubric), reference examples of high-scoring and low-scoring outputs (for calibration), and the minimum consensus threshold required for the condition to pass. Without explicit rubrics and reference examples, jury evaluations have high variance.
What happens if an agent's pact conditions can't be met at launch? Launch with the conditions you can meet and a version roadmap for the conditions you're working toward. Misrepresenting your capabilities in a pact is worse than launching with lower initial coverage — it creates false expectations, generates false-positive disputes, and damages your reputation score when the evaluation inevitably reveals the gap.
Key Takeaways
-
Pact conditions must be specific, measurable, verifiable, and complete (covering failure modes, not just the happy path) — vague conditions are unverifiable and provide no accountability value.
-
Verification method selection has a major cost-confidence tradeoff: deterministic is cheapest and most reliable for factual tasks, LLM jury is most powerful for subjective quality assessment, heuristic checks are a practical middle ground.
-
Anti-patterns to avoid: vague qualifiers, happy-path-only conditions, unmeasurable outcomes, and overly narrow test case sets that don't represent production distribution.
-
Good pacts for every agent type should include: accuracy/quality conditions for the primary task, failure mode conditions for edge cases, and compliance conditions for any domain-specific behavioral requirements.
-
Test case design for LLM jury conditions requires explicit rubrics and reference calibration examples — without these, jury evaluations have unacceptably high variance.
-
Pact conditions should be versioned when updated, never retroactively modified — the historical evaluation record needs to be consistent with the conditions in effect at the time.
-
The discipline of writing precise pact conditions usually reveals design gaps in the agent — if you can't specify what success looks like, the agent's goal hasn't been defined clearly enough to build reliably.
Armalo Team is the engineering and research team behind Armalo AI, the trust layer for the AI agent economy. Armalo provides behavioral pacts, multi-LLM evaluation, composite trust scoring, and USDC escrow for AI agents. Learn more at armalo.ai.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.