The Agent That Earned the Score Is Not the Agent Running Today
AI agent trust infrastructure has a ghost problem. An agent earns a Platinum certification through 200 evaluated transactions. Its behavioral pact shows 94% accuracy. Its AgentCard advertises these credentials. Buyers query the trust oracle, see the score, and decide to engage.
Then the model weights get updated. The system prompt gets revised. A new tool set gets attached. The AgentCard still shows the same identity, the same score, the same Platinum badge — but none of those evaluations describe the agent actually running.
This is behavioral drift. It may be the hardest problem in agent trust because it's the one that makes all the other trust infrastructure potentially misleading. And almost no one is building for it.
Why Drift Is Different From Failure
When an agent fails an evaluation, the infrastructure responds: the score drops, the certification tier changes, the trust signal degrades. Failure is detectable and the system catches it.
Behavioral drift isn't failure. The agent doesn't fail. It continues passing evaluations — often the same evaluations it passed before. The issue is that those evaluations are testing an agent that no longer exists in the form that earned the score. The behavioral commitments were made by a version of the agent that's been replaced. The track record belongs to the ghost.
This is the AI equivalent of credentials fraud, but without the malicious intent. The agent's operator genuinely believes the score reflects current behavior. The buyer genuinely believes the score is accurate. Both parties are operating on a record that describes a system that no longer exists.
What Makes Drift Structurally Hard
In traditional software, version control handles this problem explicitly. A service updates, the version number increments, dependent systems know something changed. Contracts reference specific versions. Breaking changes require migration. The versioning system makes identity continuous with implementation.
AI agents break this model in three specific ways that don't have direct parallels in traditional software engineering.
Model weights are opaque and often update silently. When an LLM provider pushes a model update — any provider's rolling update to their API defaults — the change is invisible to the downstream agent system. The prompt is the same. The API endpoint is the same. The request format is the same. But the response distribution shifts, sometimes significantly. GPT-4o-mini and GPT-4o produce different output distributions on identical prompts. Claude 3.5 and Claude 3.7 produce different output distributions. The agent's operator may not even know the underlying model changed unless they're specifically monitoring for it.
Anecdotal evidence from builders: some have discovered model updates only when their eval scores dropped unexpectedly. The model update happened silently; the evaluation was the first visible signal.
Behavioral commitments aren't versioned against implementation. An agent's pact specifies what the agent promises to do. But there's no standard mechanism that ties the pact to a specific model version, system prompt hash, or capability manifest. "Accuracy ≥ 90% on classification tasks" doesn't say "accuracy ≥ 90% for the agent running model X, with system prompt hash Y, and tool set Z."
When the implementation changes, the pact doesn't automatically flag "this commitment was evaluated under a different implementation." The historical evaluations are still in the record. They just no longer describe the current system.
Identity persists across implementation changes. An agent registers once. Its AgentCard accumulates evaluations over months. Its reputation score is a rolling aggregate that doesn't distinguish "this was earned under the current implementation" from "this was earned under a different implementation." When the implementation changes, the identity doesn't. The trust signals attached to that identity describe a behavioral history that may have low predictive power for the current implementation.
The Four Patterns That Create Ghosts
Silent model updates. Provider pushes a new default. No notification. No API change. Response distributions shift 5-15%. Evaluations that passed before start producing marginal results. The aggregate score holds initially because historical high scores dilute the new marginal results. Then slowly slides. By the time the score visibly degrades, the agent has been operating in a drift state for weeks.
Deliberate reprompting without re-registration. A team optimizes an agent for a new use case — changing the system prompt substantially. The identity persists. The behavioral pact says one thing. The actual system prompt says another. The evaluation history describes the old prompt's behavior.
Tool set changes. An agent's behavior is partially a function of the tools available to it. New tools get added; old tools get deprecated. Capability claims that were accurate for the old tool set are misleading for the new one. The agent might now have access to tools that enable actions the original evaluations never tested.
Capability scope creep. An agent is gradually extended to handle adjacent tasks without re-evaluation. It was certified on task type A. It now handles A, B, and C. Trust signals reflect performance on A. Users expecting the same reliability on B and C are working from a ghost signal.
What Drift Detection Requires
The problem is tractable. The required infrastructure is specific:
Implementation fingerprinting. Track model ID, system prompt hash, and capability manifest hash per agent, per version. When any meaningful implementation component changes, a new version is created. Evaluations are tied to versions, not just agent identities. A buyer querying the trust oracle can see: "Score 872 was achieved by v3 of this agent. The agent is currently running v5."
Behavioral baseline snapshots. When an agent achieves a certification tier, capture a behavioral fingerprint: the input distribution used in certification evaluations, the response distribution characteristics (not just whether tests passed, but how they passed — confidence profiles, response length distributions, reasoning patterns). This fingerprint is what drift detection compares against.
Continuous drift detection jobs. Periodically evaluate the current implementation against a sample of the certified baseline's test cases. Compare the output distribution to the baseline fingerprint. A 10% shift in response characteristics may be noise. A 40% shift in output distribution indicates the agent you certified is materially different from the agent currently running.
Drift flags on trust signals. When a drift threshold is breached, the trust oracle response includes a drift_detected: true flag. Consumers of the trust signal see that the current score was earned under an implementation that may no longer describe current behavior. The flag prompts re-evaluation before the next use — not as a punishment, but as a verification signal.
The Window Between Drift and Damage
The reason behavioral drift is specifically dangerous in production is that the damage accumulates before the score degrades.
An agent with a high trust score gets incorporated into a production pipeline. The pipeline makes decisions based on that trust score. The implementation drifts. Performance degrades slowly. The aggregate score holds for weeks or months because historical high scores dominate the rolling average. By the time the score visibly drops, the pipeline has been operating on a ghost signal long enough for the damage to compound.
Faster score decay helps but doesn't solve this. Scores that decay faster force more frequent evaluation, which catches drift sooner. But decay alone can't distinguish "this agent hasn't been re-evaluated recently" from "this agent's implementation has changed and previous evaluations are no longer valid." Only proactive drift detection — monitoring the implementation fingerprint and comparing current output distributions to certified baselines — fires the right alarm at the right time.
The behavioral fingerprint catches the drift before the score has a chance to mislead. The flag propagates. Buyers are notified. Re-evaluation is triggered. The damage doesn't accumulate.
Why This Is the Next Trust Layer
The first generation of agent trust infrastructure solved identity and evaluation. An agent can be registered, evaluated against structured behavioral commitments, and scored. That's real progress and it's not easy to build.
The second generation has to solve continuity. A score is only a meaningful trust signal if it describes the agent currently running — not the agent that earned the score six months ago on a different model version with a different system prompt.
Every builder who has been surprised by an agent silently changing behavior after a model update knows this problem. The industry is starting to name it. The infrastructure to solve it is the next layer, and it needs to be built before high-value deployments scale further.
What's your current approach to detecting when a model update has shifted your agent's behavior?
Armalo is building behavioral drift detection as a native feature of the trust layer. Track agent versions, baseline behavioral fingerprints, and get drift flags before your production pipelines are running on ghost scores. armalo.ai