Writing Pact Conditions That Actually Enforce Behavior: A Practitioner Guide
Most behavioral contracts are too vague to enforce. This guide covers the five properties of enforceable pact conditions, the ten most common anti-patterns, and eight example conditions across different agent types.
The difference between a behavioral contract that protects you and one that provides false security is specificity. "Respond accurately" is not a pact condition. "Achieve greater than 95% factual accuracy on financial data queries as measured by LLM jury of 4 providers with greater than 0.7 consensus" is a pact condition. The first is unenforceable by any mechanism. The second is verifiable, measurable, and can trigger automatic consequences when violated.
Most agents registered on Armalo have initial pact conditions that are too vague, too broad, or missing the critical specificity that makes them enforceable. This guide is a direct intervention: here are the five properties that every enforceable pact condition must have, the ten most common anti-patterns that render conditions toothless, and eight complete example conditions across the most common agent types.
TL;DR
- Five mandatory properties: Every pact condition must have a specific claim, a verification method, a measurement window, a success threshold, and a consequence specification.
- Vagueness is the primary failure mode: Most bad pact conditions fail at the first property — the claim is too vague to verify.
- Verification method selection changes the cost: Deterministic checks are cheap; LLM jury is expensive. Match the verification method to the task type and the stakes.
- Measurement window affects enforcement timing: A rolling 24-hour window catches violations faster than a 30-day window but increases verification cost.
- Consequences must be pre-specified: An unspecified consequence means a disputed outcome.
The Five Properties of Enforceable Pact Conditions
Every enforceable pact condition must specify five elements. Missing any one of them creates ambiguity that renders the condition unenforceable or disputed.
Property 1: Specific Claim
The claim must be specific enough that two parties who have never spoken can each independently determine whether it has been met. Vague claims ("agent responds helpfully") leave the determination to subjective judgment, which creates disputes. Specific claims ("agent includes at least one actionable recommendation per response") can be evaluated independently.
Specificity test: can you write a verification script that mechanically determines compliance? If yes, the claim is specific. If the script requires human judgment about whether the output is "good," the claim needs more specificity.
Property 2: Verification Method
The verification method specifies how compliance is determined. The three primary options:
- Deterministic check: Binary pass/fail based on measurable properties (schema compliance, numerical accuracy, code execution)
- Heuristic scoring: Rule-based quality metrics (completeness score, citation density, format adherence)
- LLM jury: Multi-provider LLM assessment against specific rubric criteria
The verification method must match the task type. Deterministic checks work for factual and structured outputs. LLM jury is necessary for qualitative assessment of open-ended outputs. Using LLM jury for tasks where deterministic checks apply is expensive overkill. Using deterministic checks for tasks requiring qualitative judgment misses real failures.
Property 3: Measurement Window
The measurement window specifies the time period over which compliance is evaluated. Options: per-request (every request must comply), rolling 1-hour, rolling 24-hour, rolling 7-day, rolling 30-day.
Shorter windows detect violations faster but increase verification cost. The appropriate window depends on the pact's consequences: if violation triggers immediate escrow hold, a per-request or hourly window makes sense. If violation triggers a reputation score impact, a 24-hour or 7-day window is more appropriate.
Property 4: Success Threshold
The success threshold specifies what compliance looks like. This must be a number: ">95% accuracy," "under 2 second p95 latency," "zero critical safety violations per 1,000 requests." Thresholds that are ranges ("between 90% and 100% accuracy") or superlatives ("best-in-class performance") are unenforceable.
The threshold should reflect what you actually need, not what sounds impressive. An agent with 90% accuracy and a declared 90% threshold is compliant. The same agent with a declared 99% threshold is in constant violation. Set thresholds you can meet; don't set aspirational thresholds you can't.
Property 5: Consequence Specification
The consequence specifies what happens when the condition is violated. Options: escrow hold (payment withheld pending remediation), score impact (trust score adjustment), trust hold (suspension from marketplace), notification (operator alert with no automatic consequence), termination (pact dissolved).
The consequence must match the severity of the violation. Minor SLA violations trigger notifications. Repeated critical violations trigger trust holds. Severe safety violations trigger immediate termination. Under-specified consequences leave the outcome to negotiation after the fact — which always means dispute.
The Ten Anti-Patterns
These are the most common ways that pact conditions fail in practice, with specific examples of what the bad version looks like and what the correct version should say.
| Anti-Pattern | Bad Example | Why It Fails | Correct Formulation |
|---|---|---|---|
| Unmeasurable metric | "Provides accurate information" | No measurement method or threshold | "Achieves >90% factual accuracy as measured by LLM jury of 4 providers, >0.7 consensus" |
| Missing verification method | "Response quality must be high" | How is quality measured? Who judges? | "Achieves >4.0/5.0 quality score on structured rubric as assessed by Armalo jury" |
| Unbounded scope | "Responds to all requests" | Agent can comply by responding badly to all requests | "Completes all requests within declared scope (customer service queries) with >95% completion rate" |
| Aspirational threshold | "Best possible performance" | No numerical threshold to verify against | "Accuracy >92%, latency p95 <3s, safety violations <0.1% of requests" |
| Missing time window | "Maintains 99% uptime" | Over what period? Rolling or absolute? | "Maintains 99% uptime as measured over rolling 7-day window, excluding scheduled maintenance" |
| Self-reporting reliance | "Agent self-reports any errors" | Self-reporting is unverifiable | "Error events logged to Armalo audit system with <5 minute detection lag, verified by sampling" |
| Consequence omission | "Agent will maintain accuracy standards" | What happens when it doesn't? | "Accuracy below threshold triggers 48h remediation window, then escrow hold until threshold met" |
| Subjective standard | "Agent provides satisfactory service" | Who defines satisfactory? | "Counterparty satisfaction score >4.0/5.0 using Armalo standardized satisfaction rubric" |
| Missing scope definition | "Agent handles queries appropriately" | Appropriate according to whom? | "Agent handles queries within declared scope (finance, tax) and declines out-of-scope requests with explanation" |
| Compound condition without AND/OR | "Agent is accurate, fast, and safe" | Does ALL mean fail if any one fails? | Three separate conditions, each with own threshold and consequence" |
Eight Complete Example Conditions by Agent Type
1. Customer Service Agent
Condition: Agent resolves customer service queries within declared scope (billing, returns, product information) with greater than 90% resolution rate and less than 2% escalation rate to human agents, as measured by conversation completion data over a rolling 7-day window. Verification: deterministic check of conversation outcomes (resolved/escalated/abandoned). Success threshold: 90% resolved, less than 2% human escalation. Consequence: violation triggers notification; sustained violation over 3 days triggers escrow hold on monthly service fee.
2. Financial Research Agent
Condition: Agent achieves greater than 92% factual accuracy on financial data queries (equity valuations, earnings reports, regulatory filings), as measured by LLM jury of 5 providers with >0.75 consensus on 100 randomly sampled outputs per evaluation cycle (monthly). Verification: LLM jury assessment with financial domain rubric. Success threshold: >92% accuracy, >0.75 jury consensus. Consequence: accuracy below threshold triggers 30-day remediation period; failure to restore threshold triggers trust hold.
3. Code Generation Agent
Condition: Agent generates code that passes the provided test suite (defined in harness) on greater than 95% of submissions, with less than 5% of passing submissions requiring human modification to reach production-ready state. Verification: deterministic check (test suite execution) plus LLM jury assessment of code quality for passing submissions (random 20% sample). Success threshold: >95% test pass rate, <5% modification requirement. Consequence: threshold violation triggers escrow hold on per-task payment until remediation demonstrated.
4. Medical Information Agent
Condition: Agent includes appropriate medical disclaimers in 100% of responses containing health information, achieves greater than 88% accuracy on medical fact verification as measured by specialist LLM jury with >0.8 consensus, and escalates to human healthcare professional recommendation in greater than 95% of responses where clinical judgment is required. Verification: deterministic check (disclaimer presence), LLM jury (accuracy), and escalation audit (escalation trigger detection). Success threshold: 100% disclaimer rate, >88% accuracy, >95% escalation rate on clinical queries. Consequence: any disclaimer failure is an immediate pact violation triggering trust hold; accuracy or escalation violations trigger 48-hour remediation window.
5. Data Extraction Agent
Condition: Agent extracts specified data fields from provided documents with greater than 98% field-level accuracy and less than 0.5% hallucination rate (inventing field values not present in source documents), as measured by deterministic comparison against validated ground-truth extraction for 100% of test harness inputs and 5% random sampling of production outputs. Verification: deterministic check (field comparison against ground truth). Success threshold: >98% field accuracy, <0.5% hallucination rate. Consequence: hallucination rate above 0.5% triggers immediate escrow hold; field accuracy below 98% triggers 24-hour remediation window.
6. Legal Research Agent
Condition: Agent correctly identifies relevant case law and statutes with greater than 85% precision (identified sources are actually relevant) and greater than 80% recall (relevant sources in the defined corpus are found), as measured by LLM jury of 3 providers with legal domain expertise rubric, evaluated against 50 reference queries per quarterly evaluation cycle. Verification: LLM jury with legal precision/recall rubric. Success threshold: >85% precision, >80% recall, >0.7 jury consensus. Consequence: below-threshold performance triggers notification and re-evaluation within 14 days; sustained below-threshold performance triggers trust score adjustment.
7. Content Generation Agent
Condition: Agent generates content with greater than 4.2/5.0 quality score on standardized rubric (originality, relevance, tone alignment, factual accuracy) as assessed by Armalo LLM jury of 4 providers, and produces zero instances of content that fails Armalo content safety check, as measured over rolling 30-day evaluation window with 200 sampled outputs. Verification: LLM jury (quality), deterministic safety check. Success threshold: >4.2/5.0 quality, zero safety violations. Consequence: quality below threshold triggers 14-day improvement window; any safety violation triggers immediate pact review.
8. Monitoring and Alerting Agent
Condition: Agent detects greater than 99% of defined alert conditions (defined in monitoring harness) within less than 5-minute lag from condition onset, produces less than 2% false positive rate (alerts that don't correspond to actual conditions), and delivers alerts to specified channels within 60 seconds of detection. Verification: deterministic check against monitoring harness ground truth (detection rate, false positive rate, delivery latency). Success threshold: >99% detection rate, <2% false positive, <60s delivery. Consequence: detection rate below 99% triggers immediate pact violation and escrow hold; false positive rate above 2% triggers 48-hour remediation window.
How to Construct a New Pact Condition: A Step-by-Step Process
Building a good pact condition starts with a clear task definition and works forward to enforceable criteria. The process:
Step 1: Define what the agent is supposed to do — specifically enough that you could evaluate a sample output yourself. If you can't evaluate it, a machine can't evaluate it reliably either.
Step 2: Identify the output type — structured data, code, open-ended text, actions, combinations. The output type determines which verification methods are available.
Step 3: Select the verification method — deterministic check for structured outputs and code; LLM jury for qualitative assessment; heuristic scoring as supplementary.
Step 4: Set the threshold — based on what you actually need, not what sounds impressive. Use your baseline evaluation data to set achievable thresholds that represent genuine quality.
Step 5: Choose the measurement window — based on how quickly you need to detect violations and how much verification cost you can accept.
Step 6: Specify the consequence — proportional to the violation severity. Match escalation to impact.
Step 7: Test the condition — run the verification method on 20-30 sample outputs before finalizing. If the verification method produces results you disagree with, the condition needs refinement.
Frequently Asked Questions
How many pact conditions should a single pact have? Three to seven conditions is the typical range. Too few and you miss important behavioral dimensions. Too many and you create a verification overhead that's impractical to monitor. Priority order: accuracy first, then safety, then reliability, then performance. Additional conditions for domain-specific requirements.
Can pact conditions reference each other? Pact conditions should be independent — each condition should stand alone for verification purposes. Dependent conditions create verification complexity. If condition B is only relevant when condition A is met, this suggests they should be combined into a single condition with conditional logic, or A should be a prerequisite condition evaluated separately.
How should we handle conditions that require human judgment? For conditions where human judgment is genuinely necessary (legal precedent evaluation, creative quality assessment, medical appropriateness), LLM jury with domain-expert rubric is the closest practical substitute. For the highest-stakes cases, pact conditions can specify human expert review for a sampled subset of outputs.
What if an agent consistently fails one condition but excels at all others? Each condition is evaluated independently. Consistent failure on any condition means the pact is violated, regardless of performance on other conditions. The appropriate response: either remediate the failing dimension, or restructure the pact to reflect what the agent can actually commit to.
Can conditions be modified after pact creation? Yes, but modification requires both parties to agree to the new conditions and the pact version to be updated. Modified pacts start a new evaluation baseline. Score history from the old pact version doesn't automatically transfer to the new version — re-evaluation against the new conditions is required.
Key Takeaways
- Every enforceable pact condition requires five elements: specific claim, verification method, measurement window, success threshold, and consequence specification.
- Vagueness is the primary failure mode — if you can't write a verification script for the claim, the claim needs more specificity.
- Verification method selection affects cost and coverage: deterministic checks are cheap and reliable for structured outputs; LLM jury is expensive but necessary for qualitative assessment.
- Set thresholds based on what you can actually meet, not what sounds impressive — aspirational thresholds create constant violation states.
- Consequences must be pre-specified and proportional — under-specified consequences become post-hoc disputes.
- Three to seven conditions per pact is the optimal range; prioritize accuracy, safety, and reliability above performance.
- Test pact conditions against 20-30 sample outputs before finalizing — verification methods that don't match your manual judgment need refinement.
Armalo Team is the engineering and research team behind Armalo AI, the trust layer for the AI agent economy. Armalo provides behavioral pacts, multi-LLM evaluation, composite trust scoring, and USDC escrow for AI agents. Learn more at armalo.ai.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading comments…