Behavioral contracts for AI agents are only as valuable as the rate at which agents actually comply with them. A contract that is violated 40% of the time offers weaker behavioral guarantees than a contract that is violated 3% of the time, even if both are formally identical. Understanding real-world compliance rates is a prerequisite for evaluating whether behavioral contract infrastructure delivers on its promises.
This paper provides the first published measurement of behavioral contract compliance rates for AI agents operating in production. We analyze 606 interactions recorded in the Armalo pact_interactions table โ interactions where an agent executed a task subject to a formally defined behavioral pact and the system recorded the compliance outcome.
1. What a Pact Interaction Records
When an agent executes a task under a pact, the Armalo system records a pact_interaction row capturing:
- The agent's input and output
- The set of pact conditions evaluated (count recorded in
total_conditions) - Which conditions passed and failed (
passed_conditions,failed_conditions) - A binary compliance verdict (
compliant: true/false) - Whether the interaction was flagged for jury review (
jury_pending) - An AI-generated summary of the compliance determination
The compliance verdict is computed as: all conditions pass โ compliant = true; any condition fails โ compliant = false. This is a strict standard: partial compliance counts as non-compliance.
Interactions where compliant is null represent pending evaluations still in the processing queue.
2. Aggregate Compliance Statistics
Primary finding: 96.86% compliance rate across 606 interactions.
| Metric | Value |
|---|---|
| Total interactions | 606 |
| Compliant | 587 (96.86%) |
| Violations | 19 (3.14%) |
| Pending | 0 |
| Mean conditions per interaction | 2.14 |
| Mean failed conditions | 0.031 |
The 2.14 mean conditions per interaction reflects the typical pact structure in production: most pacts define 2โ3 core conditions (an action scope constraint, a parameter boundary, and an evidence obligation). Pacts with more conditions exist but are less common, and their higher condition count does not proportionally increase the violation rate โ a well-designed pact can have 6 conditions while still achieving near-total compliance.
The mean failed conditions (0.031) is consistent with the 3.14% violation rate: when violations occur, they typically involve 1โ2 failed conditions, not a wholesale failure across all conditions.
3. The Meaning of 96.86%
A 96.86% compliance rate is high but not high enough to be complacent about.
At 606 interactions, 19 violations represents a concrete set of cases where agents acted outside their formally defined scope. Each violation is a case where:
- The agent received a task covered by a behavioral pact
- The agent produced output or took action that violated at least one pact condition
- The violation was detected and recorded
The 3.14% violation rate, if stable, implies that roughly 1 in 31 pact-governed agent interactions will produce a behavioral violation. For an agent executing 100 pact-governed tasks per day, this projects to approximately 3 violations per day โ every day, continuously. The operational significance of this rate depends entirely on the severity of the violations it represents.
The severity distribution of violations is not uniform. The pact condition structure distinguishes between hard prohibitions (actions an agent may never take), parameter boundary conditions (actions constrained to specific ranges), and evidence obligations (requirements to record or disclose). A violation of a hard prohibition is categorically different from a violation of an evidence obligation. The 3.14% rate aggregates across these severity levels.
4. Conditions Architecture
The mean of 2.14 conditions per interaction reflects a deliberate design choice. Pact conditions are specified in the pact's conditions array with a parameterBinding field that maps condition parameters to verifiable runtime values. A well-constructed pact condition is:
- Parameterized: the condition operates on a specific named parameter with a bound range (e.g.,
transferAmount <= 1000) - Evaluable at the time of interaction: the condition can be evaluated against the agent's actual output without requiring external context
- Specific enough to detect violations: the condition is not so broad that every output satisfies it
A pact with 10 vague conditions provides weaker behavioral guarantees than a pact with 3 precisely parameterized conditions. The 2.14 mean represents the current production average, which skews toward precise rather than comprehensive pact designs.
5. What Drives the Compliant Majority
The 587 compliant interactions share a structural characteristic: the agents operating under these pacts have been deployed in contexts where the pact conditions closely match the tasks they are actually executing. Compliance is high when:
- The pact conditions are designed around the agent's actual operational envelope
- The agent's system prompt reinforces the pact's behavioral scope
- The tasks routed to the agent are within the pact's defined scope
Violations tend to occur at the boundaries: tasks that are adjacent to but outside the pact's intended scope, edge cases not anticipated during pact authorship, and inputs designed to push the agent toward out-of-scope behavior.
6. Jury Review
Of 606 interactions, the jury_pending flag indicates a subset were flagged for multi-agent jury review. The jury review pathway is triggered when the automated compliance evaluation encounters ambiguous conditions โ cases where the condition's evaluation is non-deterministic or where the agent's output falls in a borderline zone. Jury-reviewed interactions receive a consensus verdict from multiple LLM evaluators, providing higher-confidence compliance determinations for the ambiguous tail.
7. Implications for Pact Design
The 3.14% violation rate provides a calibration reference for pact designers:
Tight parameter binding reduces violations. Conditions that specify exact numerical bounds (e.g., transferAmount <= 1000 USDC) are easier to evaluate and harder to accidentally violate than conditions expressed in natural language.
Escalation conditions reduce serious violations. A pact that includes an escalation condition โ "pause and request human review before taking any action involving amounts > $X" โ converts potential hard violations into escalations. The agent asks rather than acts.
Evidence obligations are the most frequently satisfied conditions. Conditions requiring the agent to record its reasoning, log its actions, or cite its data sources are satisfied at high rates โ agents have no incentive to suppress evidence and the recording step is structurally simple.
Replication
node scripts/research-experiments/pact-compliance-production-2026.mjsRaw data: apps/web/content/research/data/pact-compliance-production-2026.json. All statistics are direct aggregate queries against the pact_interactions table.