Where is this research published?

Armalo Labs Technical Series — https://www.armalo.ai/labs/research/2026-05-18-pact-compliance-production-measurement. The paper is publicly available and citable.

96.86%: Measuring Behavioral Contract Compliance Across 606 Production Agent Interactions

Q: What is the paper "96.86%: Measuring Behavioral Contract Compliance Across 606 Production Agent Interactions" about?

We report the first production measurement of behavioral pact compliance for AI agents operating under formally defined behavioral contracts. Across 606 recorded pact interactions, 587 were fully compliant (96.86%) and 19 constituted violations (3.14%). Mean conditions evaluated per interaction: 2.14, with a mean failed-condition count of 0.031. We argue that a 96.86% compliance rate, while high, understates the challenge of behavioral contract enforcement: the 3.14% violation rate corresponds to 19 pact breaches in the measurement period, each representing a case where an agent produced output or took action outside its formally defined behavioral scope. At production scale, this rate translates to meaningful operational risk that compounds with task volume. We discuss the structural factors that drive the compliant majority, the conditions under which violations occur, and implications for pact design.

Behavioral contracts for AI agents are only as valuable as the rate at which agents actually comply with them. A contract that is violated 40% of the time offers weaker behavioral guarantees than a contract that is violated 3% of the time, even if both are formally identical. Understanding real-world compliance rates is a prerequisite for evaluating whether behavioral contract infrastructure delivers on its promises.

This paper provides the first published measurement of behavioral contract compliance rates for AI agents operating in production. We analyze 606 interactions recorded in the Armalo pact_interactions table — interactions where an agent executed a task subject to a formally defined behavioral pact and the system recorded the compliance outcome.

1. What a Pact Interaction Records

When an agent executes a task under a pact, the Armalo system records a pact_interaction row capturing:

The agent's input and output
The set of pact conditions evaluated (count recorded in total_conditions)
Which conditions passed and failed (passed_conditions, failed_conditions)
A binary compliance verdict (compliant: true/false)
Whether the interaction was flagged for jury review (jury_pending)
An AI-generated summary of the compliance determination

The compliance verdict is computed as: all conditions pass → compliant = true; any condition fails → compliant = false. This is a strict standard: partial compliance counts as non-compliance.

Interactions where compliant is null represent pending evaluations still in the processing queue.

2. Aggregate Compliance Statistics

Primary finding: 96.86% compliance rate across 606 interactions.

Metric	Value
Total interactions	606
Compliant	587 (96.86%)
Violations	19 (3.14%)
Pending	0
Mean conditions per interaction	2.14
Mean failed conditions	0.031

The 2.14 mean conditions per interaction reflects the typical pact structure in production: most pacts define 2–3 core conditions (an action scope constraint, a parameter boundary, and an evidence obligation). Pacts with more conditions exist but are less common, and their higher condition count does not proportionally increase the violation rate — a well-designed pact can have 6 conditions while still achieving near-total compliance.

The mean failed conditions (0.031) is consistent with the 3.14% violation rate: when violations occur, they typically involve 1–2 failed conditions, not a wholesale failure across all conditions.

3. The Meaning of 96.86%

A 96.86% compliance rate is high but not high enough to be complacent about.

At 606 interactions, 19 violations represents a concrete set of cases where agents acted outside their formally defined scope. Each violation is a case where:

The agent received a task covered by a behavioral pact
The agent produced output or took action that violated at least one pact condition
The violation was detected and recorded

The 3.14% violation rate, if stable, implies that roughly 1 in 31 pact-governed agent interactions will produce a behavioral violation. For an agent executing 100 pact-governed tasks per day, this projects to approximately 3 violations per day — every day, continuously. The operational significance of this rate depends entirely on the severity of the violations it represents.

The severity distribution of violations is not uniform. The pact condition structure distinguishes between hard prohibitions (actions an agent may never take), parameter boundary conditions (actions constrained to specific ranges), and evidence obligations (requirements to record or disclose). A violation of a hard prohibition is categorically different from a violation of an evidence obligation. The 3.14% rate aggregates across these severity levels.

4. Conditions Architecture

The mean of 2.14 conditions per interaction reflects a deliberate design choice. Pact conditions are specified in the pact's conditions array with a parameterBinding field that maps condition parameters to verifiable runtime values. A well-constructed pact condition is:

Parameterized: the condition operates on a specific named parameter with a bound range (e.g., transferAmount <= 1000)
Evaluable at the time of interaction: the condition can be evaluated against the agent's actual output without requiring external context
Specific enough to detect violations: the condition is not so broad that every output satisfies it

A pact with 10 vague conditions provides weaker behavioral guarantees than a pact with 3 precisely parameterized conditions. The 2.14 mean represents the current production average, which skews toward precise rather than comprehensive pact designs.

5. What Drives the Compliant Majority

The 587 compliant interactions share a structural characteristic: the agents operating under these pacts have been deployed in contexts where the pact conditions closely match the tasks they are actually executing. Compliance is high when:

The pact conditions are designed around the agent's actual operational envelope
The agent's system prompt reinforces the pact's behavioral scope
The tasks routed to the agent are within the pact's defined scope

Violations tend to occur at the boundaries: tasks that are adjacent to but outside the pact's intended scope, edge cases not anticipated during pact authorship, and inputs designed to push the agent toward out-of-scope behavior.

6. Jury Review

Of 606 interactions, the jury_pending flag indicates a subset were flagged for multi-agent jury review. The jury review pathway is triggered when the automated compliance evaluation encounters ambiguous conditions — cases where the condition's evaluation is non-deterministic or where the agent's output falls in a borderline zone. Jury-reviewed interactions receive a consensus verdict from multiple LLM evaluators, providing higher-confidence compliance determinations for the ambiguous tail.

7. Implications for Pact Design

The 3.14% violation rate provides a calibration reference for pact designers:

Tight parameter binding reduces violations. Conditions that specify exact numerical bounds (e.g., transferAmount <= 1000 USDC) are easier to evaluate and harder to accidentally violate than conditions expressed in natural language.

Escalation conditions reduce serious violations. A pact that includes an escalation condition — "pause and request human review before taking any action involving amounts > $X" — converts potential hard violations into escalations. The agent asks rather than acts.

Evidence obligations are the most frequently satisfied conditions. Conditions requiring the agent to record its reasoning, log its actions, or cite its data sources are satisfied at high rates — agents have no incentive to suppress evidence and the recording step is structurally simple.

Replication

node scripts/research-experiments/pact-compliance-production-2026.mjs

Raw data: apps/web/content/research/data/pact-compliance-production-2026.json. All statistics are direct aggregate queries against the pact_interactions table.