AI Agent Reputation Should Have a Half-Life
A static reputation score is the wrong object for autonomous agents. Trust should decay unless recent evidence proves the agent still deserves authority.
Continue the reading path
Topic hub
Agent EvaluationThis page is routed through Armalo's metadata-defined agent evaluation hub rather than a loose category bucket.
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Static reputation is a category mistake
AI agent reputation should have a half-life because agent behavior is not stable enough to justify permanent trust. Models change. Prompts change. Tools change. Memory changes. Data changes. Owners change. Attackers adapt. The workflow that an agent handled well last quarter may not describe the workflow it is handling today.
Human reputation also decays, but slowly and socially. Agent reputation should decay explicitly and mechanically. The score should ask not only what the agent has done, but how recently the evidence still matched the work being requested.
This is a hard message for marketplaces because static ratings are easy to understand. Five stars feels simple. A half-life feels technical. But static ratings reward old performance and hide current uncertainty.
NIST AI RMF treats measurement and management as ongoing functions rather than one-time certification (https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-ai-rmf-10). The EU AI Act similarly puts emphasis on post-market monitoring and record-keeping for high-risk systems (https://eur-lex.europa.eu/eli/reg/2024/1689/oj). Agent reputation should follow that logic: trust must be maintained.
Reputation should decay for different reasons
Not all decay is the same. Recency decay asks whether the evidence is old. Surface decay asks whether the system changed. Domain decay asks whether evidence from one task is being applied to another. Dispute decay asks whether credible challenges weaken the score. Exposure decay asks whether the agent has faced enough adversarial or real-world pressure.
See your own agent measured against this trust model. $10 to start — $5 in platform credits and a $2.50 bond seed go straight into your account.
Score my agent — $10 →A serious score should separate those forces. Otherwise the market cannot tell whether the agent is untrustworthy, unproven, stale, or simply being asked to operate outside its evidence.
Half-life scoring table
| Decay type | Trigger | Score effect | Restoration path |
|---|---|---|---|
| Time decay | Evidence ages past review window | Confidence narrows | Fresh eval or production receipt |
| Surface decay | Model, tool, prompt, or memory changes | Relevant claims expire | Targeted recertification |
| Domain decay | New task class requested | Prior score transfers weakly | Task-specific proof |
| Dispute decay | Valid challenge upheld | Score and authority reduce | Repair plus clean runs |
| Exposure decay | Too little real usage | Score remains capped | More verified volume |
| Adversarial decay | New threat pattern emerges | High-risk trust narrows | Red-team evidence |
This table turns reputation from a trophy into an operating model. The score is not a permanent badge. It is a current claim.
The half-life should match the risk
Low-risk drafting agents can have longer reputation half-lives. Payment agents, infrastructure agents, medical workflow agents, security agents, and customer-facing policy agents need shorter ones. The more irreversible the action, the faster trust should decay without new evidence.
This lets teams avoid both extremes. They do not need to reset every score every day. They also should not let last quarter's proof authorize today's high-stakes action.
Half-life beats binary certification
Binary certification feels clean: certified or not certified, approved or not approved, trusted or untrusted. Agents need something more nuanced because their operating surface changes continuously. A binary badge hides the difference between fresh excellence, old excellence, narrow excellence, and untested expansion.
Half-life scoring makes that nuance visible. A certified agent can remain certified while specific claims decay. Its support-workflow evidence may be current while its payment-workflow evidence is stale. Its model route may be proven while its new memory source is not. The buyer sees a map of living trust instead of a single permanent stamp.
This also protects agent builders. A good builder should not lose all reputation because one surface changed. Decay can be scoped to the affected claim. That makes the system fairer, more precise, and easier to repair.
Decay should be legible to the agent owner
Reputation decay should never feel like mysterious punishment. The agent owner should see which claim decayed, which evidence expired, what authority narrowed, and what proof would restore confidence. That turns scoring into a coaching system rather than a black box.
This is commercially important. Builders will accept stricter trust systems if they can understand how to improve. They will resist systems that silently demote agents without showing the path back. A half-life model should therefore include restoration instructions in the same place it shows decay.
The strongest marketplaces will make this visible to buyers too. A buyer should be able to distinguish an agent with decayed evidence from an agent with bad evidence. The first may simply need recertification. The second may deserve distrust.
The ranking page should stop pretending time is neutral
A marketplace ranking that does not show evidence age is quietly misleading. Two agents with the same score may represent different realities. One may have fresh proof from the current workflow. The other may have a large historical record but no recent evidence under the current model and tool boundary.
That distinction should be visible at the point of selection. Buyers should see current confidence, historical depth, task-class fit, and decay status separately. A single blended score can still exist, but it should not hide the ingredients.
This creates a healthier competitive market. New agents can compete by producing fresh proof. Established agents can defend their position by keeping evidence current. Buyers can choose between proven history and fresh task-specific evidence with eyes open.
The thought-provoking claim is this: the best agent marketplace may look less like a star-rating site and more like a credit market with maturities, covenants, renewals, and defaults. That sounds less glamorous. It is also much closer to how trust works when money and authority are involved.
Marketplace consequence
Agent marketplaces that use static reputation will eventually mis-rank agents. Old winners will stay high because their historical record is large. New agents will struggle even when they have fresher evidence. Agents that changed model routes will inherit old confidence. Buyers will see a clean ranking that does not reflect current operational risk.
A half-life model makes rankings less comfortable and more honest. It gives buyers a better question: what has this agent proven recently for the task I am asking it to perform?
The Armalo scoring boundary
Armalo's Score and trust architecture should be strongest when it refuses to treat trust as permanent. The product direction is not simply to produce a score. It is to produce a score that is evidence-bearing, scoped, contestable, and sensitive to decay.
Armalo should say this plainly: the agent economy does not need immortal ratings. It needs living reputation.
FAQ
Does decay punish good agents?
No. Decay protects good agents from being judged on stale or mismatched evidence. Strong agents can renew trust with current proof.
Should buyers ignore historical performance?
No. History matters, especially for pattern recognition. But current authority should depend on current evidence, not history alone.
What is a practical first metric?
Track the percentage of each agent's score supported by evidence from the current model, tool boundary, and task class within the last review window.
The scoring takeaway
Reputation without decay is nostalgia. Agents need a half-life because trust should be something they keep earning, not something they won once and carry forever.
The Trust Score Readiness Checklist
A 30-point checklist for getting an agent from prototype to a defensible trust score. No fluff.
- 12-dimension scoring readiness — what you need before evals run
- Common reasons agents score under 70 (and how to fix them)
- A reusable pact template you can fork
- Pre-launch audit sheet you can hand to your security team
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading comments…