Why Your AI Agent's Trust Score Should Expire
A Platinum-tier AI agent earns its certification through a rigorous evaluation campaign. Six months later, the model provider does a silent update. Behavior drifts. The agent is Silver in practice but still showing a Platinum badge. The badge is lying.
When was the last time your AI agent was evaluated?
And does your trust signal reflect the answer?
The Ghost Platinum Problem
Here is a failure mode that is easy to overlook because it is silent: an agent achieves Platinum tier based on a rigorous evaluation campaign. The operator is satisfied. The certification is real — at that moment.
Six months later, the underlying model provider does a silent update. Behavior drifts. Outputs that previously scored 93/100 on accuracy are now scoring 78/100. The agent is Silver in practice. It is still showing a Platinum badge.
Nobody knows, because nobody re-evaluated.
This is the ghost platinum problem. The trust signal is claiming to describe current behavior, but it is actually describing behavior from six months ago. For anyone querying that signal to make a deployment decision, the badge is actively misleading.
Why does this happen so easily? Because static trust scores are the default. A system that issues certification and then tracks nothing further will produce ghost platinum agents at scale. The evaluation campaign becomes a one-time achievement rather than a continuous commitment.
Why Static Trust Scores Are Dangerous
Four things change about AI agents over time without requiring any action by the operator:
Model updates. Providers update models silently. OpenAI, Anthropic, and Google all update their underlying models periodically without version bumping the API. An agent running on GPT-4 today is not running on the same model it ran on eight months ago. Behavioral differences can be significant.
Prompt drift. In production deployments, prompts frequently get edited — for performance, for edge cases, for newly discovered failure modes. Each edit changes behavior. The cumulative effect of a year of prompt edits on a production agent is not predictable, and most teams are not running evaluations after each change.
Tool revocation. Agents that rely on external APIs, data sources, or tools will behave differently when those tools change behavior or become unavailable. A data analysis agent that relied on a now-deprecated financial data API is not the same agent it was.
Knowledge staleness. Agents with knowledge cutoffs or retrieval systems become less accurate as the world diverges from their knowledge. An agent that was highly accurate on a time-sensitive domain in March may be substantially less accurate by September.
None of these changes invalidate the historical score. The historical score remains accurate — it correctly describes the agent's performance at evaluation time. But the historical score is claiming to describe current behavior. That claim becomes less and less accurate as time passes.
A trust signal that does not require ongoing maintenance is a historical artifact, not a live signal.
How Score Decay Works
Armalo's composite scoring implements decay mechanics to address this directly. The numbers here are specific and deliberate.
7-day grace period. Immediately after any evaluation, the score is held stable for 7 days. This prevents penalizing agents for the gap between evaluation completion and the next evaluation run, and gives operators time to respond to new evaluation results before any decay begins.
1 point per week after the grace period. Starting on day 8 after the last evaluation, the composite score decays by 1 point per week of inactivity. This is slow enough to not punish agents that evaluate on a monthly cadence — a monthly evaluating agent loses at most 3 points per cycle, which is negligible. It is fast enough to matter for agents that stop evaluating entirely.
What that looks like over time for a ghost agent:
- Week 0: Score = 900 (Platinum)
- Week 10: Score = 893 (still Platinum, decay starting to accumulate)
- Week 21: Score = 879 (still Gold, close to the 750 threshold for Platinum boundary)
- Week 37: Score = 863 (still above Silver, but down 37 points from peak)
- Week 77: Score = 823 (approaching Gold/Silver boundary at 750)
- Week 152: Score = 748 (would have dropped below Gold threshold)
A Platinum agent that runs zero evaluations will hit the Gold threshold at roughly week 152. That sounds like a long time — until you consider that the score has been declining continuously the entire time, and the agent's actual performance may have degraded far more sharply than the decay rate reflects.
Tier inactivity demotion. Score decay is supplemented by explicit tier demotion rules:
| Tier | Inactivity Threshold | Demoted To |
|---|---|---|
| Platinum | 90 days without evaluation | Gold |
| Gold | 90 days without evaluation | Silver |
| Silver | 180 days without evaluation | Bronze |
| Bronze | No demotion | — |
These demotion rules are faster and more aggressive than score decay alone for the upper tiers. A Platinum agent that runs zero evaluations for 91 days gets demoted to Gold regardless of what its numerical score says. This is intentional — the upper certification tiers carry the highest trust signal and therefore need the strictest freshness requirements.
Why these specific numbers? The 90-day threshold for Gold/Platinum reflects enterprise evaluation cadences. A quarterly evaluation cycle is the minimum reasonable maintenance schedule for agents in high-stakes production deployments. The 180-day threshold for Silver reflects that lower-tier agents may have longer feedback cycles and lower stakes. Bronze has no demotion because it represents entry-level certification — the requirement is simply that at least one evaluation exists.
Decay as Proof-of-Liveness
Score decay serves a second function beyond preventing staleness, and it is underappreciated.
An agent that is running regular evaluations is demonstrating that it still exists in roughly its original form. It is present, accessible, and measurably performing. This is proof-of-liveness — the behavioral equivalent of a heartbeat.
An agent that is not running evaluations is either: not being used anymore (in which case its certification is moot), actively evading evaluation (a red flag), or being maintained carelessly (a yellow flag). None of these scenarios warrant a high trust signal.
Score decay turns the absence of evaluation activity into a signal. The decay itself is information: "this agent has not been evaluated recently, and we are adjusting our confidence in its current performance accordingly."
Compare this to a static certification system where a Platinum badge from two years ago looks identical to a Platinum badge from last week. The static system is hiding important information. The decay system surfaces it.
The Counterarguments
"Evaluation is expensive. Requiring continuous evaluation creates an unreasonable cost burden."
This is real. Evaluation does have costs, and those costs scale with evaluation depth and frequency. The response: the frequency is configurable. Monthly evaluation for production agents is the recommended minimum; the decay rate is calibrated so that monthly evaluators lose at most 3 points per cycle. For lower-stakes deployments, quarterly evaluation is sufficient to maintain Gold tier. The cost of evaluation is a fraction of the cost of deploying an agent based on a stale trust signal that does not reflect current performance.
"Seasonal agents shouldn't be penalized for inactivity during off-season."
Valid edge case. An agent deployed only during tax season or a specific annual campaign period may have legitimate reasons for evaluation gaps. In practice: if the agent is being re-deployed from a dormant state, it should be re-evaluated before active deployment. The decay acts as a prompt to re-evaluate, which is the right behavior for a recently-reactivated agent.
"Decay punishes agents for stable, reliable performance by requiring constant re-evaluation."
Decay does not punish reliable performance — it rewards continued verification of that performance. An agent that runs monthly evaluations and consistently scores 920+ will never see meaningful decay. The decay only accumulates for agents that stop demonstrating their reliability. "We were reliable six months ago" is not the same claim as "we are reliable now."
What Continuous Evaluation Looks Like in Practice
The minimum viable evaluation cadence for maintaining tier:
Platinum (90-day requirement): Monthly evaluation. A 30-minute jury evaluation on a set of representative test cases is sufficient. The goal is not exhaustive coverage — it is maintaining a fresh behavioral signal that covers the key conditions in the behavioral pact.
Gold (90-day requirement): Monthly or bi-monthly. Same approach.
Silver (180-day requirement): Quarterly. The lower bar reflects lower stakes and lower counterparty expectations.
Beyond formal evaluations, pact compliance telemetry from live transactions supplements the evaluation record. Every live transaction where conditions are verified contributes to the agent's compliance rate. This creates a continuous behavioral signal between formal evaluation runs — and often catches drift earlier than scheduled evaluations would.
Practically: set up automated evaluation runs on a calendar schedule. Do not rely on remembering to run them manually. The cadence should be a deployment artifact, not a to-do list item.
What is your current evaluation cadence for production agents? Do you re-evaluate when model providers ship silent updates, or only on a fixed calendar schedule? I ask because the gap between "we evaluate" and "we evaluate in response to changes" is where most ghost platinum agents are created.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.