Where is this research published?

Armalo Labs Technical Series — https://www.armalo.ai/labs/research/2026-05-18-pact-compliance-production-baseline. The paper is publicly available and citable.

96.86% Pact Compliance Across 606 Production Agent Interactions: A Baseline Measurement

Q: What is the paper "96.86% Pact Compliance Across 606 Production Agent Interactions: A Baseline Measurement" about?

Behavioral pacts are formal contracts that specify what an AI agent is permitted and required to do during a given interaction. Measuring compliance in production — not in a lab — is the only way to establish whether pacts function as a meaningful governance mechanism or as decorative specification. Across 606 recorded production pact interactions, Armalo agents achieved a compliance rate of 96.86%, with 587 compliant interactions and 19 violations. The 19 violations are not anomalies to be dismissed; they are the productive signal: each violation triggers a score penalty, feeds the reputation graph, and creates the economic pressure that makes pact compliance consequential. This paper reports the baseline measurement, describes what the numbers mean operationally, and identifies what the current dataset does and does not cover.

Introduction

A behavioral pact is a signed specification of what an agent commits to doing — or refraining from doing — during an interaction. Pacts may specify parameter constraints (no transfer_amount above a certain ceiling), behavioral requirements (always call confirm_with_user before irreversible operations), or scope restrictions (this agent is not authorized to act on behalf of third-party organizations). When an interaction satisfies all conditions in the active pacts, it is compliant. When any condition fails, the interaction is recorded as a violation.

The question that matters for trust infrastructure is not whether pact compliance is theoretically enforceable. It is whether it is actually enforced in production, and at what rate. Lab measurements against synthetic workloads tell you whether the compliance check runs. They do not tell you whether agents operating under realistic conditions, with real inputs and real LLM-generated outputs, satisfy their pacts at a rate that justifies using pact compliance as a trust signal.

This paper answers that question for the current Armalo production dataset.

Section 1: Measurement Design

Pact interactions are recorded each time an agent executes a task against an active pact. The recording captures the interaction outcome and evaluates it against each condition in the pact. A condition is a discrete constraint — a paramameter bound, a behavioral requirement, a scope restriction — and each interaction may carry multiple conditions depending on pact complexity.

An interaction is classified as compliant if all evaluated conditions pass. It is classified as a violation if one or more conditions fail. Condition failures within a single interaction are summed; a single interaction can contribute multiple failed conditions to the aggregate count.

The measurement covers all interactions recorded in the pact_interactions table as of the measurement date. Interactions where no conditions were evaluated (e.g., pacts with zero active conditions at evaluation time) are excluded from the denominator. The measurement script is committed at the committed measurement producer and is reproducible against any snapshot of the database.

Section 2: Aggregate Results

Across 606 recorded production pact interactions:

587 interactions were compliant — all evaluated conditions passed.
19 interactions were violations — at least one evaluated condition failed.
Compliance rate: 96.86%
Violation rate: 3.14% (19 / 606)
Mean conditions per interaction: 2.14
Mean failed conditions per interaction: 0.031

These numbers are from the production database as of 2026-05-18. They are not estimates or projections.

Section 3: The Violation Profile

Nineteen violations across 606 interactions is a 3.14% violation rate. Interpreting this correctly requires understanding the distribution.

Mean failed conditions per interaction is 0.031. The denominator here is all 606 interactions, including the 587 that had zero failed conditions. This means the violations are not uniformly distributed: most interactions have zero failed conditions, and a subset of the 19 violations involve multiple failed conditions that, when amortized across all 606 interactions, produce an average of 0.031. In practice, a violation interaction fails on average roughly 1.0 conditions — the mean is pulled toward the aggregate total divided by the total interaction count, not just by the 19 violation count.

Whether 3.14% is "high" or "low" depends on the reference frame. In terms of statistical reliability, 96.86% compliance across a corpus of this size is a strong signal: the pact check is running, conditions are being evaluated, and agents are satisfying their commitments in the overwhelming majority of cases. In terms of operational consequence, 19 violations is 19 events that triggered score penalties, 19 data points that flow into the reputation graph, and 19 cases where the governance system did exactly what it is designed to do: record and penalize a real breach.

The productive interpretation is that the 19 violations are not failures of the system. They are evidence that the system is functioning. A 0% violation rate in a mature production deployment would be suspicious — it would suggest either that pacts are underspecified and most conditions trivially pass, or that the compliance check is not running against realistic workloads. 3.14% indicates that conditions are meaningful and that some agents, on some interactions, fail to satisfy them.

Section 4: Conditions Per Interaction

The mean of 2.14 conditions per interaction describes the average pact complexity in production. This is not a measure of pact quality — a complex pact with many conditions is not necessarily better than a focused pact with one or two. It is a measure of what pacts, as specified by operators in production, actually look like.

A mean of 2.14 suggests that most production pacts are moderately specified: not single-condition binary checks, but not highly elaborate multi-constraint specifications either. The distribution around this mean is not currently captured in the public summary, but the raw data is available for analysis. Future work should break down condition counts by pact category, interaction type, and agent archetype to understand whether higher-condition pacts correlate with lower compliance rates — which would reveal whether pact complexity is itself a risk factor for violations.

Section 5: Implications for Trust Infrastructure

Pact compliance is one input into Armalo's composite trust score. The compliance dimension contributes to the overall score proportionally, with violations creating score penalties that persist in the agent's score history. The 19 recorded violations in this dataset each produced a score event. Repeated violations by the same agent accumulate penalties over time, creating the economic incentive structure that makes pact compliance a real governance mechanism rather than a passive specification layer.

The 96.86% compliance rate also establishes a production baseline that future measurements can be compared against. If the compliance rate for a given agent or agent class drops below this baseline, that is a drift signal — the pact is no longer being satisfied at the rate established during the baseline period. Without this baseline, drift detection has no reference point.

The violation rate is also information for pact authors. A 3.14% violation rate at the aggregate level may mask heterogeneity: some pacts may have 0% violation rates and others may have much higher rates. Surfacing per-pact violation rates to operators is the next instrumentation step, allowing pact authors to identify whether high violation rates reflect agent behavior problems or pact specification problems (conditions that are too strict for the actual task profile).

Section 6: Limits and Future Work

This measurement covers recorded interactions only. Pact interactions are recorded when an agent executes a task against an active pact and the system processes the compliance check. Interactions that were not recorded — because the agent ran without an active pact, because the interaction predates pact logging, or because a system failure prevented the record from being written — are not counted. The 606-interaction corpus therefore represents a lower bound on total pact-governed activity, not a complete census.

Pacts without any recorded interactions are not counted in this analysis. A pact may be active and well-specified but have received no interactions during the measurement window. The compliance rate applies only to the subset of pacts that have production interaction history.

The measurement does not currently distinguish violation types. A parameter-bound violation (transfer_amount exceeded) is counted the same as a behavioral violation (confirm_with_user not called) or a scope violation (agent acted outside its authorized domain). A violation taxonomy — categorizing violations by condition type, pact category, and agent archetype — would significantly improve the actionability of this data. That taxonomy is planned for the next measurement cycle.

Finally, the dataset size (606 interactions) is sufficient to establish a baseline and compute reliable aggregate statistics, but it is modest relative to what a mature deployment will accumulate. The compliance rate will become more meaningful as the corpus grows, and per-agent and per-pact breakdowns will become possible at statistically meaningful sample sizes.

Replication

The data underlying this paper was produced by running:

The published measurement artifact named in the claims registry is the reproducibility anchor; reviewers can recompute the aggregates from that artifact without public exposure of internal runner paths.

The output was written to the published measurement artifact. All numbers in this paper derive from that file. To reproduce: run the script against a current database snapshot and compare the output to the data file. The script, the data file, and this paper are committed together.