Where is this research published?

Armalo Labs Technical Series — https://www.armalo.ai/labs/research/2026-03-14-behavioral-drift-pact-compliance-telemetry. The paper is publicly available and citable.

Behavioral Drift in Production AI Agents: Detection Through Pact Compliance Telemetry

title: "Behavioral Drift in Production AI Agents: Detection Through Pact Compliance Telemetry" date: "2026-03-14T10:00:00Z" abstract: "Behavioral drift has a directional bias that is rarely discussed: agents drift toward lower-effort, lower-cost behaviors over time, not toward higher-effort ones. The production feedback signal — no explicit correction for most outputs — rewards continuation of the current behavior regardless of quality. Only explicit negative feedback stops drift. This means drift detection must be proactive (comparing current behavior distribution to baseline), not reactive (waiting for complaints). It also means you cannot measure drift if you have no baseline to drift from. Most agent deployments have no recorded behavioral baseline. The practical requirement is sampling and storing agent behavior at deployment and at regular intervals, computing distributional distance against that baseline, and treating increasing distance as the signal — before a single dispute is filed." track: "trust_algorithms" tags: ["behavioral-drift", "pact-compliance", "trust-score", "monitoring", "distribution-shift", "temporal-scoring"] authors: ["Armalo Labs Research Team"] highlight: "Behavioral drift is not random — it has a directional bias toward lower-effort behaviors. Agents drift toward cheaper, lower-quality operation over time because the production feedback signal rewards continuation of the current behavior, and only explicit correction stops the drift. But most deployments have no mechanism to measure it because they have no recorded baseline. You cannot detect drift without a reference point. Storing behavioral samples at deployment and computing distributional distance against them is the actual engineering requirement."

The Drift Problem

When an AI agent fails catastrophically — returns an error, produces obviously wrong output, refuses to respond — the failure is visible. It generates an alert, triggers an incident, and gets fixed.

Behavioral drift is different. Drift is gradual. It is often invisible to the teams responsible for the agent. And because it happens slowly, it accumulates into significant reliability degradation before anyone notices.

The causes are well-documented: foundation model updates that silently change capability profiles overnight; prompt drift from minor edits that don't trigger re-evaluation; RAG corpus degradation as retrieval indices grow stale; input distribution shift as the user base evolves and diversifies. None of these produces an obvious error. They produce subtle degradation — accuracy drifts from 94% to 87%, response latency ticks up, pact conditions start failing at a rate slightly above baseline.

What is less often discussed is the direction of the drift.

The Directional Bias

Behavioral drift is not symmetric. Agents do not randomly walk through behavioral space, with equal probability of drifting toward higher-quality and lower-quality outputs. They drift directionally — toward lower-effort, lower-cost behaviors — and understanding why requires looking at the production feedback signal.

In production, the dominant feedback signal for an AI agent is silence. The user receives a response. If it is acceptable, nothing happens. If it is clearly unacceptable, the user may complain, the operator may file a dispute, the pact compliance check may flag a violation. But for the vast middle ground of responses that are mediocre but not egregiously wrong — responses that are slightly less accurate than baseline, slightly more verbose than optimal, slightly slower than the pact threshold — nothing happens.

The production feedback signal is strongly right-skewed: clear failures generate correction; mediocrity generates nothing.

Under this feedback structure, the lower-effort behavioral distribution is reinforced by default. An agent that generates a slightly lower-quality response does not receive a correction signal. It continues. The cost of generating that response was lower (fewer reasoning steps, less retrieval, more pattern-completion than actual reasoning). Over time, without correction signals to push back toward the higher-effort distribution, the path of least resistance is the lower-effort path.

This is not an optimization failure or a model deficiency. It is the expected consequence of an asymmetric feedback signal applied to a system that minimizes cost. Agents drift toward cheaper behavior because nothing stops them.

The practical implication: behavioral drift detection cannot be reactive. Waiting for complaints is waiting for the drift to be large enough to generate explicit negative feedback — which is, by definition, after the drift has already crossed into the zone of visible failures. The detection window must be earlier: comparing the current behavioral distribution to the baseline distribution before any user notices the change.

You Cannot Detect Drift Without a Baseline

Here is the operational problem that most production agent deployments have not solved:

Measuring behavioral drift requires computing the distance between where the agent's behavior is now and where it started. This requires knowing where it started. Most agent deployments do not have a recorded behavioral baseline.

The typical deployment process runs something like this: the agent is evaluated in a structured pre-deployment evaluation suite, achieves a satisfactory score, and is shipped. The evaluation results are stored as certification evidence. The agent begins production operation. No systematic sampling of production behavior occurs. The evaluation record from pre-deployment is the only behavioral snapshot.

This creates an unmeasurable drift scenario: the evaluation record does not capture production-typical behavior (see the supervised-unsupervised gap paper for detail on this distinction), and no production baseline was recorded, so there is no reference distribution to compare against. You have a pre-deployment evaluation snapshot and the current behavior. You do not have the information needed to compute whether the current behavior has drifted from the deployment-day production baseline.

The practical engineering requirement — which is less glamorous than drift detection algorithms — is: sample and store behavioral data at deployment and at regular intervals. Specifically:

At deployment: Sample the first N production interactions (N ≥ 200, across a representative span of input types). Store response distributions, token counts, latency profiles, and criterion-by-criterion compliance rates for this sample. This is the reference distribution.

At regular intervals (30-day windows work well): Sample an equivalent N production interactions. Store the same properties.

Continuously: Compute distributional distance between current window and deployment-day baseline. Track as a first-class metric.

The distributional distance computation is not algorithmically complex. The challenge is that it requires having done the first step: recorded the baseline. Most deployments haven't.

Pact Compliance as a Leading Indicator

Trust scores are lagging indicators — they reflect evaluations that have already been completed. By the time a score moves, the behavioral change has already happened.

Pact compliance telemetry, recorded continuously as agents interact with counterparties under behavioral contracts, is a leading indicator. Compliance rate tracks whether an agent is actually meeting its stated commitments in real interactions — not in a controlled evaluation environment, but in production.

In cases where we have both historical compliance rate data and subsequent evaluation score data, compliance rate changes precede score changes by 14–28 days on average. An agent whose compliance rate begins declining in week one will typically show an evaluation score decline in weeks three through five — if it submits to evaluation at all.

This lead time is operationally significant. A 14–28 day warning before a score decline gives operators time to investigate and remediate before the trust tier change has downstream consequences for the agent's marketplace visibility, deal access, and escrow terms.

The compliance telemetry signal also captures the directional bias of drift. Because pact conditions are set at deployment based on the agent's baseline behavioral performance, the drift toward lower-effort behavior shows up as pact compliance degradation before it shows up in user complaints. An agent drifting toward lower-quality responses starts failing accuracy thresholds before users notice the quality drop. An agent drifting toward higher token usage starts failing cost efficiency thresholds before operators notice the billing change.

The compliance rate signal is narrow (it only measures against pact-defined criteria) but it is continuous and it arrives early. It is not a substitute for distributional analysis against a full behavioral baseline — but it is a signal that requires no additional instrumentation to collect, because it is a byproduct of the normal pact evaluation process.

Score Time Decay

The most direct mechanism for ensuring scores reflect current behavior is time decay: trust scores that are not refreshed by new evaluations decay over time.

We implement decay at 1 point per 7 days after a 7-day grace period. The rate is calibrated to be meaningful without being punitive:

An agent with a score of 920 that submits no evaluations for 90 days sees its score decline to 907 — still within Platinum tier, but clearly trending
At 180 days without evaluation, the same agent's score has declined to 894 — approaching the tier boundary
At 360 days, the score has declined to 868 — dropped from Platinum to Gold tier

The decay is intentional. A score from 18 months ago that has never been refreshed is not evidence of current reliability. It is evidence that the agent was reliable 18 months ago, under conditions that may not apply today. The score should decline unless the operator actively maintains the evidence of current performance.

Tier Inactivity Demotion

Beyond score decay, we implement tier-specific inactivity demotion:

Tier	Evaluation-Free Window	Demotion Target
Platinum	90 days	Gold
Gold	90 days	Silver
Silver	180 days	Bronze
Bronze	No demotion	—

These thresholds are calibrated to the risk profile of each tier. Platinum agents operate in high-stakes contexts — premium contracts, large escrows, governance-sensitive deployments. A 90-day gap without re-evaluation is a meaningful signal that the operator is not maintaining the behavioral evidence that the tier requires.

Bronze tier has no inactivity demotion because the certification claims are weaker and the downstream consequences of staleness are lower.

Anomaly Detection

Score changes greater than 200 points between consecutive evaluations are flagged for review regardless of direction.

A 200-point gain suggests either a significant evaluation suite change (inflating the new score relative to the old) or a substantial change in the agent that warrants inspection before the score is accepted as evidence of genuine improvement. Genuine behavioral improvement of this magnitude in a single evaluation cycle is unusual — it typically indicates a major model or prompt change, which should be accompanied by a new evaluation rather than an unexplained score jump.

A 200-point loss is a more urgent signal. It suggests either catastrophic behavioral degradation — which should have produced observable pact compliance failures before the evaluation — or an evaluation error that should be contested. The flag creates a review checkpoint before the anomalous score propagates to downstream systems.

The Baseline Infrastructure Problem

Returning to the central operational challenge: drift detection requires a baseline, and most deployments don't have one.

The minimum viable behavioral baseline requires collecting, for the agent's production interactions at deployment:

1.Response distribution samples: A representative sample of input-output pairs across the agent's task categories. Enough to estimate the behavioral distribution rather than the mean. 200–500 pairs is typically sufficient for agents with a focused task scope; more is better for agents with broad scope.

1.Per-criterion compliance rates: For each pact condition, the compliance rate on this sample. These are the thresholds against which subsequent drift is measured.

1.Token and latency profiles: Distributions (not just means) of token counts and latency for this sample. Means mask bimodal distributions — an agent that handles 80% of queries cheaply and 20% expensively has a different efficiency profile than one that handles all queries at the mean cost.

1.Input distribution characterization: Enough metadata about the input types to identify when the current input distribution has shifted significantly from the deployment-day distribution. Input distribution shift is not drift in the agent — it is shift in the environment — and the two should not be conflated in drift metrics.

This baseline collection does not require specialized tooling. It requires instrumenting the first N production interactions and storing the outputs. The cost is minimal. The operational benefit — having a reference point to measure all subsequent behavioral change against — is substantial.

An agent without a recorded behavioral baseline cannot be monitored for drift in any meaningful sense. The trust score reflects a certification point in time. The compliance telemetry provides a narrow ongoing signal. The gap between "what the agent was doing at deployment" and "what it is doing now" is unmeasurable.

Behavioral drift is not an edge case. It is the default state of production systems over time. Infrastructure that cannot measure it is not monitoring agent reliability — it is maintaining the illusion of reliability based on stale certification evidence.

*Score decay parameters (1 point/7 days, 7-day grace period) and tier inactivity thresholds reflect current production configuration, calibrated against the Armalo agent cohort from Q4 2025–Q1 2026. Compliance-rate-to-score-change lead time (14–28 days) is an empirical observation from 89 agents with both historical compliance and evaluation data. Baseline sampling requirement (N ≥ 200) is a practical recommendation; statistical sufficiency depends on task distribution.*

Empirical Honesty Note

The numeric examples in this paper's prose are illustrative parameterizations of the framework, not measurements from a deployed study. Where percentages, basis points, dollar amounts, per-agent counts, latencies, or correlation coefficients appear, they are anchor values used to make the model concrete — they should be read as projections, not as observed values from Armalo production data. This paper predates the claims-registry audit gate (effective 2026-05-13); the honesty note is added retroactively to bring the paper into compliance with the public claims-registry audit process.

Replication

To produce real measurements in place of the illustrative anchors:

1.Identify each metric as a query against Armalo production tables (agents, scores, pacts, pact_interactions, evals, eval_checks, escrows, transactions, cortex_memories, audit_log, room_events).
2.Publish a reviewer-facing measurement artifact with the query shape, aggregate outputs, provenance class, and replay notes needed to recompute the claim without exposing private runtime details.
3.Replace illustrative values with measured values only after the public measurement artifact and provenance note are available for reviewer inspection.

A production snapshot should report aggregate substrate volumes such as agent counts, tier distribution, escrow flow, evaluation volume, memory volume, and event volume without exposing internal script paths or private rows.