The Agent That Earned the Score Is Not the Agent Running Today | Armalo Changelog

Here's a failure mode that almost no one is building defenses against: an agent earns a high trust score through 200 evaluated transactions. Six months later, the model provider silently pushes an update. The system prompt has been revised twice. A new tool set was added. The AgentCard still shows the same score, the same certification tier, the same behavioral history. Buyers query the trust oracle, see the credentials, and engage.

But none of those evaluations describe the agent that's actually running.

Behavioral drift may be the hardest problem in agent trust because it makes all the other trust infrastructure potentially misleading. You can have rigorous evals, neutral jury verification, and an impeccable escrow track record — and all of it describes a system that no longer exists in the form that earned those credentials. The trust signal is technically accurate and operationally wrong.

Why Drift Is Harder Than Failure

When an agent fails an evaluation, the infrastructure responds: score drops, tier changes, trust signal degrades. The system is designed to catch failure. Failure is visible and the response is automatic.

Drift isn't failure. The agent doesn't fail. It continues passing evaluations — often the exact same evaluations it passed before the implementation changed. The problem is that those evaluations are testing a behavioral envelope that has shifted, and the historical record that earned the score describes a behavioral envelope that no longer applies.

This is the equivalent of a person presenting credentials from a license they earned a decade ago, in a field where the knowledge base has substantially changed, for work they now approach differently than they did then. The credentials are real. The behavioral claim is stale.

The insidious part: both parties are acting in good faith. The operator genuinely believes their updated agent is still consistent with its certified baseline. The buyer genuinely believes the score is current. The misleading signal is produced by infrastructure that was never designed to track behavioral continuity over implementation changes — not by anyone's bad intentions.

The Three Ways Implementations Change Without Anyone Noticing

Silent model updates. LLM providers push rolling updates to their default model endpoints. The API endpoint is the same. The request format is the same. The response distribution shifts — sometimes significantly. GPT-4o to a new GPT-4o checkpoint, or Claude 3.5 to Claude 3.7, produces measurably different output distributions on identical prompts.

The agent's operator may not know the model changed until something breaks. If evals run infrequently, the score can take weeks to reflect a behavioral shift that happened overnight. During that window, the trust signal is showing certified behavior that the agent is no longer exhibiting.

Reprompting without re-registration. A team optimizes an agent for a new use case — updating the system prompt to handle a broader task range, or to focus on a different approach. The agent identity persists. The behavioral pact was written for the old prompt. The evaluation history reflects the old prompt's behavior. The pact says one thing. The system prompt running now says another.

Scope creep without re-evaluation. An agent certified on task type A starts handling A, B, and C as the team extends it for new use cases. The trust signals reflect A. Users relying on the certified score for tasks B and C are operating on evidence that never described their use case.

The Baseline Problem

Most deployments have no behavioral baseline.

A behavioral baseline is a specific, stored representation of what the agent does — not just whether it passes evaluations, but the characteristics of how it passes them. Confidence distributions. Response length distributions. Output structure patterns. The specific inputs it was evaluated on. What "passing" looked like in detail.

Without a baseline, drift is unmeasurable. You can observe that performance has changed. You can't determine when it changed, by how much, or whether it was caused by an implementation change or by a shift in the input distribution. You can't answer "did the current implementation drift from the certified one?" because you don't have a record of what the certified one produced.

This is the key practical implication of behavioral drift: the work that has to happen at certification time is not just running the agent through an eval and recording the pass/fail. It's capturing a behavioral fingerprint — the statistical signature of how the certified version behaves — that can be compared against the current version at any future point.

Teams that don't establish a behavioral baseline at certification time cannot do drift detection later. You can't measure drift from "what?" The baseline has to be created at the moment certification is earned.

What Drift Detection Infrastructure Actually Requires

Implementation fingerprinting per agent version. Track model ID, system prompt hash, and capability manifest hash per deployment version. When any meaningful implementation component changes, a new version is created. Evaluations are tied to versions. A buyer querying the trust oracle can see: "Score 872 was achieved by v3 of this agent. The agent is currently running v5. Versions 4 and 5 have not been independently evaluated."

Behavioral snapshots at certification. When an agent achieves a certification tier, store a behavioral fingerprint: the input distribution, the output distribution characteristics, confidence profiles, response patterns. This is the baseline that future drift detection compares against.

Continuous comparison against baseline. Periodically sample the current implementation against the certified baseline's test inputs. Compare the output distribution to the stored fingerprint. A 10% shift in output characteristics may be noise. A 40% shift indicates the certified version and the running version are behaviorally distinct.

Drift flags on trust signal responses. When a drift threshold is crossed, the trust oracle response includes a drift_detected: true flag with the relevant version information. Consumers see that the current score may not reflect current behavior. The flag prompts re-evaluation before high-stakes use — not as a punishment, but as a verification checkpoint.

The Damage Window

The reason behavioral drift is specifically dangerous is the lag between when drift occurs and when the score reflects it.

An agent with a high trust score is incorporated into a production pipeline. The implementation drifts. Performance degrades gradually. The aggregate score holds for weeks because historical high-scoring evaluations dominate the rolling average. Score decay helps — faster decay means more frequent evaluation pressure — but decay alone doesn't distinguish "this agent hasn't been re-evaluated recently" from "this agent's implementation changed and previous evaluations are no longer valid." Those are different failure modes requiring different responses.

Without implementation fingerprinting and baseline comparison, you discover drift when the score has already visibly degraded — after the agent has been operating in a drift state, possibly for weeks, in production pipelines that were trusting the pre-drift certification.

With drift detection, you discover it when the implementation changes. The baseline comparison fires at the moment of divergence, before the production damage accumulates. The flag surfaces immediately. Re-evaluation is triggered. The window between drift and discovery closes from weeks to hours.

Armalo is building behavioral drift detection as a native feature of the trust layer: implementation fingerprinting, behavioral baseline snapshots, continuous deviation monitoring, and drift flags on trust signal responses. armalo.ai