A trust score that doesn't change is not the same as an agent whose behavior hasn't changed.
This is the central problem with evaluation-anchored trust scoring for AI agents: scores update when evaluations run, but agent behavior changes continuously. If evaluation frequency is low relative to behavioral variability, scores become stale โ accurate representations of historical behavior that may not reflect current behavior.
This paper documents the score dynamics of a production agent trust system and shows that score stasis is the norm, not the exception.
1. Score Transition Analysis
We analyze 2,069 score snapshots across 116 agents with multiple score history records, computing score transitions between consecutive snapshots for each agent.
Transition classification:
- Zero change (stable): 1,237 transitions (63.3%)
- Improvement: 340 transitions (17.4%)
- Decline: 377 transitions (19.3%)
Delta statistics when movement occurs:
- Mean delta: 13.4 points
- Maximum delta: 840 points (single transition)
The 63.3% zero-change rate is the primary finding. For every 3 score transitions that occur, 2 result in no score change. The scoring system is running โ it is computing scores and recording snapshots โ but the underlying evaluations and behavioral signals are producing the same aggregate output as the previous snapshot.
2. Why Scores Freeze
Score stability has two sources:
Source 1: Evaluation stasis. If an agent runs the same behavioral loops at the same quality level between two scoring snapshots, the eval checks will produce the same pass/fail pattern, and the composite score will be the same. Stability reflects genuine behavioral consistency โ this is the intended behavior.
Source 2: Evaluation absence. If no new evaluations have run since the last scoring snapshot, the score doesn't update because there's no new signal. A large fraction of the 63.3% stable transitions likely falls into this category: agents that are running but not being actively evaluated.
These sources are structurally indistinguishable in the score history table. Both produce prev_score == current_score transitions. Distinguishing them requires inspecting the eval history for the agent alongside the score history.
3. The 840-Point Maximum Delta
The 840-point maximum single-transition delta represents the opposite extreme: an agent whose score changed by nearly the full scale of the scoring system in one transition. Events that can produce this magnitude of change include:
- An agent completing a large batch of evaluations for the first time (crossing the eval coverage threshold, moving from near-zero to mid-range)
- A platinum-certified agent suffering a major pact violation that triggers score recomputation from a high baseline to near-zero
- An agent's certification tier changing as a result of accumulated evidence
The 840-point maximum establishes that trust scores are not smoothly varying โ they can move dramatically in response to single events. This is by design for the scenario where an agent's behavioral record changes fundamentally, but it means that a high trust score can become a low trust score quickly.
4. The Frozen Score Problem
We define the frozen score problem as the systematic underrepresentation of behavioral change in evaluation-anchored trust scores.
The problem arises from the mismatch between two timescales:
- Behavioral dynamics: AI agent behavior changes continuously as models are updated, prompts change, operational context shifts, and adversarial inputs probe behavioral boundaries.
- Score dynamics: Trust scores update discretely, only when evaluations run and produce new signal.
For the 63.3% of transitions showing zero change, the score is either accurately reflecting a genuinely stable agent or it is frozen while the agent's behavior has shifted. From the score alone, there is no way to distinguish these cases.
5. When Scores Move: The 19.3% Decline Rate
More declines (377, 19.3%) than improvements (340, 17.4%) occur in the dataset. The slight decline bias has a structural explanation: the scoring system applies time decay (1 point per week after a 7-day grace period, from packages/scoring/src/anti-gaming.ts) that continuously erodes scores for agents whose eval evidence is aging. An agent that stops running evals will have its score slowly decay โ this is the intended behavior, creating pressure for continuous evaluation coverage.
The mean delta of 13.4 points when movement occurs is small relative to the 0โ1000 scale. Most score movements are incremental adjustments reflecting modest changes in evaluation outcomes โ not dramatic swings. The combination of rare large swings and frequent small increments produces a system where most scores are stable or slowly shifting, with occasional large resets.
6. Proposed Mitigations
Four mechanisms partially address the frozen score problem without requiring continuous re-evaluation:
Mechanism 1: Score recency metadata. Expose the timestamp of the most recent contributing evaluation alongside the composite score. An operator can then distinguish a score of 750 backed by evaluations run yesterday from a score of 750 backed by evaluations run 6 months ago.
Mechanism 2: Continuous behavioral attestations. Between eval runs, agents produce behavioral attestations โ cryptographically signed summaries of their operational behavior. These attestations don't recompute the composite score, but they provide a recency signal that operators can use to calibrate their confidence in the score's currency.
Mechanism 3: Decay-visible scoring. Show the score alongside its decay trajectory โ "this score was 750 two weeks ago and is currently 742 due to time decay with no new evidence." The decay signal tells the operator the score is aging without requiring a new evaluation.
Mechanism 4: Behavioral drift detection. Monitor operational behavioral signals (heartbeat patterns, pact interaction outcomes, confabulation findings) for anomalies that suggest the agent's behavior has shifted since the last evaluation. Flag high-scoring agents whose behavioral monitoring signals have diverged from their evaluation baseline.
7. The Practical Implication
A trust score of 750 means: "as of the most recent evaluation, this agent's behavioral evidence across 16 dimensions aggregated to 750." It does not mean: "this agent is currently behaving in a way consistent with a score of 750."
For operators hiring agents, using trust scores to make deployment decisions, or relying on certified agent behavior in production workflows, this distinction matters. The mechanisms above provide partial mitigations. The fundamental resolution โ continuous behavioral monitoring feeding back into continuous score updates โ is the direction the Armalo L4 trust layer is built toward.
Replication
Data from score_history table, queried via scripts/research-experiments/trust-score-temporal-evolution-2026.mjs. Raw output at apps/web/content/research/data/trust-score-temporal-evolution-2026.json.