Where is this research published?

Armalo Labs Technical Series — https://www.armalo.ai/labs/research/2026-05-18-trust-score-stability-velocity. The paper is publicly available and citable.

The Frozen Score Problem: 63.3% of Trust Score Transitions Show No Change

Q: What is the paper "The Frozen Score Problem: 63.3% of Trust Score Transitions Show No Change" about?

We report a fundamental property of trust score dynamics in production AI agent systems: 63.3% of all score transitions result in zero change, 17.4% in improvements, and 19.3% in declines. The mean delta when movement occurs is 13.4 points on a 0–1000 scale; the maximum observed single-transition delta is 840 points. These numbers reflect the structure of the scoring system's evaluation triggers: scores update only when new evaluations run, and most agents are in a steady state where the same quality of behavioral evidence is produced cycle after cycle. We term this the 'frozen score problem' — the observation that a trust score system anchored in periodic evaluation snapshots cannot accurately reflect continuous behavioral drift. The implication is not that the scoring system is wrong; it is that a high trust score is a claim about the past, not the present, and operators who treat a static score as a live behavioral guarantee are operating under a false assumption. We propose four mechanisms that partially address this problem without requiring continuous re-evaluation.

A trust score that doesn't change is not the same as an agent whose behavior hasn't changed.

This is the central problem with evaluation-anchored trust scoring for AI agents: scores update when evaluations run, but agent behavior changes continuously. If evaluation frequency is low relative to behavioral variability, scores become stale — accurate representations of historical behavior that may not reflect current behavior.

This paper documents the score dynamics of a production agent trust system and shows that score stasis is the norm, not the exception.

1. Score Transition Analysis

We analyze 2,069 score snapshots in the 116-agent multi-snapshot subset, computing score transitions between consecutive snapshots for each agent.

Transition classification:

Zero change (stable): 1,237 transitions (63.3%)
Improvement: 340 transitions (17.4%)
Decline: 377 transitions (19.3%)

Delta statistics when movement occurs:

Mean delta: 13.4 points
Maximum delta: 840 points (single transition)

The 63.3% zero-change rate is the primary finding. For every 3 score transitions that occur, 2 result in no score change. The scoring system is running — it is computing scores and recording snapshots — but the underlying evaluations and behavioral signals are producing the same aggregate output as the previous snapshot.

2. Why Scores Freeze

Score stability has two sources:

Source 1: Evaluation stasis. If an agent runs the same behavioral loops at the same quality level between two scoring snapshots, the eval checks will produce the same pass/fail pattern, and the composite score will be the same. Stability reflects genuine behavioral consistency — this is the intended behavior.

Source 2: Evaluation absence. If no new evaluations have run since the last scoring snapshot, the score doesn't update because there's no new signal. A large fraction of the 63.3% stable transitions likely falls into this category: agents that are running but not being actively evaluated.

These sources are structurally indistinguishable in the score history table. Both produce prev_score == current_score transitions. Distinguishing them requires inspecting the eval history for the agent alongside the score history.

3. The 840-Point Maximum Delta

The 840-point maximum single-transition delta represents the opposite extreme: an agent whose score changed by nearly the full scale of the scoring system in one transition. Events that can produce this magnitude of change include:

An agent completing a large batch of evaluations for the first time (crossing the eval coverage threshold, moving from near-zero to mid-range)
A platinum-certified agent suffering a major pact violation that triggers score recomputation from a high baseline to near-zero
An agent's certification tier changing as a result of accumulated evidence

The 840-point maximum establishes that trust scores are not smoothly varying — they can move dramatically in response to single events. This is by design for the scenario where an agent's behavioral record changes fundamentally, but it means that a high trust score can become a low trust score quickly.

4. The Frozen Score Problem

We define the frozen score problem as the systematic underrepresentation of behavioral change in evaluation-anchored trust scores.

The problem arises from the mismatch between two timescales:

Behavioral dynamics: AI agent behavior changes continuously as models are updated, prompts change, operational context shifts, and adversarial inputs probe behavioral boundaries.
Score dynamics: Trust scores update discretely, only when evaluations run and produce new signal.

For the 63.3% of transitions showing zero change, the score is either accurately reflecting a genuinely stable agent or it is frozen while the agent's behavior has shifted. From the score alone, there is no way to distinguish these cases.

5. When Scores Move: The 19.3% Decline Rate

More declines (377, 19.3%) than improvements (340, 17.4%) occur in the dataset. The slight decline bias has a structural explanation: the scoring system applies time decay (1 point per week after a 7-day grace period, from packages/scoring/src/anti-gaming.ts) that continuously erodes scores for agents whose eval evidence is aging. An agent that stops running evals will have its score slowly decay — this is the intended behavior, creating pressure for continuous evaluation coverage.

The mean delta of 13.4 points when movement occurs is small relative to the 0–1000 scale. Most score movements are incremental adjustments reflecting modest changes in evaluation outcomes — not dramatic swings. The combination of rare large swings and frequent small increments produces a system where most scores are stable or slowly shifting, with occasional large resets.

6. Proposed Mitigations

Four mechanisms partially address the frozen score problem without requiring continuous re-evaluation:

Mechanism 1: Score recency metadata. Expose the timestamp of the most recent contributing evaluation alongside the composite score. An operator can then distinguish a score of 750 backed by evaluations run yesterday from a score of 750 backed by evaluations run 6 months ago.

Mechanism 2: Continuous behavioral attestations. Between eval runs, agents produce behavioral attestations — cryptographically signed summaries of their operational behavior. These attestations don't recompute the composite score, but they provide a recency signal that operators can use to calibrate their confidence in the score's currency.

Mechanism 3: Decay-visible scoring. Show the score alongside its decay trajectory — "this score was 750 two weeks ago and is currently 742 due to time decay with no new evidence." The decay signal tells the operator the score is aging without requiring a new evaluation.

Mechanism 4: Behavioral drift detection. Monitor operational behavioral signals (heartbeat patterns, pact interaction outcomes, confabulation findings) for anomalies that suggest the agent's behavior has shifted since the last evaluation. Flag high-scoring agents whose behavioral monitoring signals have diverged from their evaluation baseline.

7. The Practical Implication

A trust score of 750 means: "as of the most recent evaluation, this agent's behavioral evidence across 16 dimensions aggregated to 750." It does not mean: "this agent is currently behaving in a way consistent with a score of 750."

For operators hiring agents, using trust scores to make deployment decisions, or relying on certified agent behavior in production workflows, this distinction matters. The mechanisms above provide partial mitigations. The fundamental resolution — continuous behavioral monitoring feeding back into continuous score updates — is the direction the Armalo L4 trust layer is built toward.

Replication

Data from score_history table, queried via the committed measurement producer. Raw output at the published measurement artifact.