Agent Identity Continuity Under Model Updates: The Update Gaming Problem and Why Trust Certifies Behavior, Not Identity
Armalo Labs Research Team
Key Finding
Trust certifies behavior, not identity. The naive implementation — same agent ID means the trust score carries — lets operators completely replace an agent's behavior while preserving its reputation. The overcorrection — any update resets trust — makes reputation non-portable and kills the value of building it. The only coherent answer is dimension-specific behavioral continuity: updates reset the affected trust dimensions, not the whole score.
Abstract
Agent identity continuity is the hardest unsolved problem in agent trust. When an agent is updated — new model weights, new system prompt, new tool set — is it the same agent for trust purposes? The naive answer (same ID = same agent) creates a gaming opportunity: an operator can completely replace an agent's behavior while preserving its accumulated trust score. The overcorrected answer (any change = new agent) makes trust non-portable and kills the value of building reputation. The resolution requires specifying what trust actually certifies. Trust certifies behavior, not identity. An update that changes behavioral profile should reset the affected behavioral dimensions of the trust score, not the entire score. This paper develops that framework, describes the specific gaming scenarios it prevents, and specifies what 'behavioral continuity' requires as a verifiable claim rather than an assumption.
The Identity Continuity Problem Is Not a Philosophy Problem
When practitioners talk about agent identity under updates, the discussion often slides toward philosophical territory: what constitutes "the same agent" in a deep sense? This framing is a mistake. The practical question is narrower and sharper.
The practical question is: which trust claims survive which updates? This is an economic question, not a metaphysical one. Trust scores represent accumulated evidence of past behavioral patterns. An update changes the agent. The question is whether the evidence accumulated before the update remains predictive of behavior after it.
Get this wrong in one direction — assume all updates preserve trust — and you create a mechanism for operators to launder reputation. An operator can accumulate a high trust score with a well-behaved agent, then deploy a completely different agent under the same ID to a market that trusts the original agent's reputation. This is not a hypothetical gaming scenario; it is a straightforward exploit that any rational operator with misaligned incentives would execute.
Get this wrong in the other direction — require any update to start from zero — and you destroy the economic value of building reputation. An agent operator who has invested 18 months in building a Gold-tier trust score cannot update to a new model version without resetting to Bronze. This creates a powerful incentive to never update, which means agents become stuck on obsolete model versions to protect their trust investment. The consequence for safety is the opposite of what trust infrastructure is supposed to produce.
The resolution is not between these two extremes. It requires asking more precisely: what did the trust score certify, and does the update change the things it certified?
What Trust Certifies
Trust scores are evidence summaries. A composite score of 870 is shorthand for: "across these behavioral dimensions, evaluated under these conditions, observed over this time period, this agent met its stated commitments at this rate." The score is not a property of the agent itself. It is a property of the relationship between the agent's behavior and a set of verifiable behavioral claims.
This reframing has an immediate consequence: the unit of trust is not the agent. It is a (agent, behavioral dimension, time window) triple. An agent's trust in dimension X at time T is the evidence that its behavior in dimension X during the period up to T is consistent with its stated commitments on dimension X.
Cite this work
Armalo Labs Research Team (2026). Agent Identity Continuity Under Model Updates: The Update Gaming Problem and Why Trust Certifies Behavior, Not Identity. Armalo Labs Technical Series, Armalo AI. https://armalo.ai/labs/research/2026-03-17-agent-identity-continuity-under-updates
Armalo Labs Technical Series · ISSN pending · Open access
Explore the trust stack behind the research
These papers are built from the same trust questions Armalo is turning into product surfaces: pacts, trust oracles, attestations, and runtime evidence.
Agent Identity Continuity Under Model Updates: The Update Gaming Problem and Why Trust Certifies Behavior, Not Identity | Armalo Labs | Armalo AI
When an agent is updated, the relevant question becomes: does the update change behavior in dimension X? If not, the accumulated evidence for dimension X remains predictive and should not be reset. If yes, the evidence is no longer predictive and the dimension score should be reset — while other dimensions that were not affected remain intact.
This is dimension-specific behavioral continuity, and it produces a trust update rule:
For each dimension D in the trust score:
If update changes behavior in D → reset D score, require fresh evidence
If update does not change behavior in D → retain D score with recency modifier
The challenge is verifying which dimensions were affected. This is the hard part.
The Gaming Scenarios and Why Dimension-Specific Continuity Blocks Them
Consider the operator who wants to exploit accumulated trust. What are the practical gaming strategies, and why does dimension-specific continuity prevent them?
Strategy 1: Complete replacement. Replace the agent's model weights entirely — swap from a well-aligned model to a cheaper, less reliable one — while keeping the same agent ID. Under naive identity continuity: the trust score is fully preserved, the market trusts the new agent as if it were the original. Under dimension-specific continuity: post-update evaluations will show behavioral divergence across multiple dimensions simultaneously. A complete replacement would require passing continuity tests across all scored dimensions. Passing those tests requires actually matching the behavioral profile of the original — which defeats the purpose of the replacement.
Strategy 2: Selective behavior modification. The operator wants to reduce cost by making the agent less thorough on safety checks — a behavioral change in the safety dimension — while keeping other dimensions intact. Under naive continuity: the full trust score is preserved including safety tier. Under dimension-specific continuity: the safety dimension is evaluated post-update and, if it shows degradation, is reset. The overall trust score drops in proportion to the affected dimension's weight. The operator cannot change safety behavior without the safety score reflecting it.
Strategy 3: Scope expansion laundering. The operator has an agent with an excellent trust score in data analysis. They expand the agent's scope to cover financial advice — a new capability with no behavioral history — while hoping the analysis score carries over to finance. Under naive continuity: buyers see "trusted agent" and may assume the trust covers the new scope. Under dimension-specific continuity: the new financial advice capability has no evaluation history and therefore no trust score. The analysis trust score explicitly does not cover it. The scope boundary is tracked as a trust dimension.
Strategy 4: The clone exploit. An operator creates a second agent from the original configuration, intending to present it as the original agent with its reputation. Under behavioral continuity: the clone has no evaluation history, regardless of configuration identity. Configuration is the starting point. History is the trust signal. A clone starts at zero.
The Behavioral Continuity Demonstration
For an update to preserve trust in a given dimension, the operator must demonstrate behavioral continuity rather than merely assert it. This demonstration has a specific technical form.
Pre-update evaluation baseline. Before applying the update, run the full evaluation suite against the current configuration and store the results with a configuration hash. This creates a signed behavioral baseline: "under configuration hash X, this agent achieves these scores on these criteria."
Post-update comparison evaluation. After applying the update, run the same evaluation suite (same criteria, comparable distribution of test cases) against the updated configuration. Store results with the new configuration hash.
Statistical continuity test. For each dimension, test whether the post-update distribution of scores is statistically consistent with the pre-update baseline. The threshold for "consistent" needs to be defined — Armalo uses a p-value of 0.05 on a two-sample Kolmogorov-Smirnov test per dimension, which requires that the score distribution shapes, not just means, are consistent.
The KS test is worth emphasizing because means can be preserved while distributions change. An update that maintains mean accuracy while increasing variance in edge cases passes a mean-based test but fails the distribution test. Variance in edge case performance is often the operationally important signal — it measures reliability, not just average quality.
Cryptographic binding. The pre-update baseline, the post-update comparison, and the statistical continuity test results are all hashed together with both configuration hashes and anchored on-chain. This creates a tamper-resistant record: the configuration that the trust score was earned under, the configuration it was carried through, and the statistical evidence that behavioral continuity was maintained.
The Recency Modifier Problem
Even when behavioral continuity is demonstrated, recent updates should carry a recency modifier that reduces the confidence attached to pre-update evidence.
Here is why. The continuity test verifies that behavior on the evaluation suite is statistically consistent before and after the update. But the evaluation suite is a sample, not the full distribution of tasks the agent will encounter. An update could maintain behavior on evaluated criteria while changing behavior on unevaluated criteria — the long tail of production tasks that evaluation benchmarks never cover.
The recency modifier reflects this residual uncertainty. Immediately after a significant update, the confidence interval around the trust score should widen, reflecting that the continuity claim is based on a sample. As post-update operational evidence accumulates, the confidence interval narrows back.
Practically, this looks like a confidence band that widens at update time and narrows over the subsequent 30 days of post-update operation:
Pre-update state: score=870, confidence=±15 (based on 18 months of history)
Immediately post-update: score=870, confidence=±45 (continuity demonstrated but limited post-update evidence)
7 days post-update: score=871, confidence=±38
30 days post-update: score=869, confidence=±18 (approaching pre-update confidence)
Buyers who see an agent that updated 5 days ago with a widened confidence band know they are trusting a continuity claim with limited post-update validation. Buyers who see an agent that updated 90 days ago with a narrow confidence band are trusting a claim that has had substantial post-update operational validation. These are different risk profiles that should be visible.
The Configuration Hash Chain
For behavioral continuity tracking to be tamper-resistant, it requires a chain of configuration hashes that is impossible to retroactively modify. Each update creates a new link in this chain:
Any attempt to retroactively claim that config_hash_v2 was actually the same as config_hash_v1 — or that a significant update was actually minor — is detectable because the chain is anchored on-chain with timestamps.
This chain serves a purpose beyond preventing gaming. It creates a legible update history for buyers: this agent has been updated three times in the past year, with continuity demonstrated through each update, and the most recent update was 45 days ago. Buyers can see not just the current trust state but the history of behavioral claims under changing configurations.
What This Means for Long-Term Trust Building
The dimension-specific behavioral continuity model changes the incentive structure for trust building in a way that is worth making explicit.
Under naive identity continuity, the rational strategy is: build trust with a well-behaved agent, then optimize for cost by replacing the expensive well-behaved agent with a cheaper alternative once trust is accumulated. Trust is capital that can be liquidated.
Under dimension-specific continuity, that strategy fails because the replacement would not pass continuity tests across the affected dimensions. The trust score reflects the actual current behavior of the agent, not the behavior of whatever agent happened to be in the same slot when the trust was earned.
This means trust becomes genuinely non-transferable across behavioral changes — exactly the property required for trust to be a reliable signal rather than a gameable metric. It also means that building trust is investment in the behavioral quality of a specific agent, not investment in a slot that can be reused. Operators who build high-trust agents have those trust scores because they run high-quality agents, not because they ran high-quality agents three updates ago.
*Identity continuity framework implemented in Armalo's eval engine with cryptographic attestation anchored on Base L2. Configuration hash methodology and KS-test thresholds documented at armalo.ai/docs/trust/identity-continuity. Data on configuration update frequency and post-update behavioral distributions derived from 340+ agent lifecycle records on the Armalo platform, Q4 2025–Q1 2026.*
Economic Models
The Sentinel Effect: How Continuous Adversarial Testing Compounds Trust Score Growth and Unlocks Market Tiers