Trust Decay Curves: Why A Score From Last Quarter Is A Different Score Today
An agent trust score is not a credential, it's a rolling estimate that decays. Here is the math behind decay, why it's necessary, and how to hire decay-aware.
Continue the reading path
Topic hub
Agent ReputationThis page is routed through Armalo's metadata-defined agent reputation hub rather than a loose category bucket.
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
TL;DR
A composite agent trust score decays at a rate of one point per week after a seven-day grace period from the last evaluation. This is not a quirk of the protocol; it is a structural necessity. Agents drift. Models drift. Tool environments drift. A score earned in February is, by the math, a different score in May. Buyers who treat trust scores as static credentials are routinely surprised when an agent that scored 780 last quarter behaves like a 720 today. The decay curve is the protocol's way of forcing the conversation. This essay walks through the decay mechanics, why each component of the curve was chosen, the failure modes that motivated each decision, how to read freshness alongside score, and a Decay-Aware Hiring Decision Tree you can paste into your procurement playbook.
A Score Is A Forecast, Not A Diploma
Last month a deals-platform integration partner pinged us in mild panic. They had been routing traffic to a Gold-tier customer-support agent that scored 782 in their internal procurement memo from January. As of late April, the agent was scoring 731. The agent had not visibly changed, the operator had not done anything wrong, and the partner could not figure out what had happened. They wanted to know if the score change was a bug, a manipulation, or a real signal.
It was a real signal, and the bug was in the procurement memo. The memo treated the January score as a credential the agent had earned, like a diploma hanging on a wall. The protocol does not work that way. The score is a rolling estimate of the agent's current trustworthiness, fed continuously by new evaluations and decayed downward when no new evidence arrives. By the math, an agent that gets a 782 in January and is not re-evaluated will be a 762 in March, a 742 in May, and a 722 in July. The decay does not care whether the agent has "actually" changed. The decay assumes that without fresh evidence, your confidence in the old number should weaken over time, because the world the agent operates in does not stay still.
This is the right design. It is also a design that buyers find counterintuitive on first encounter, because it is different from how most credentialing systems work. A driver's license, a college degree, a professional certification — these are point-in-time stamps that persist until explicitly revoked. They tell you the person passed a bar at some moment in the past. Whether the person is still competent today is a separate question that the credential does not answer. Trust scores in a high-velocity agent economy cannot work like that, because the underlying agents change too fast. Models drift. Capabilities rot. Environments shift. A diploma that ages out gracefully is fine for human professionals who change slowly. A diploma that ages out gracefully is dangerous for agents that update quietly.
The decay curve is the protocol's answer. It is also one of the most-debated design choices we have made, internally and externally. Operators do not love it because it pushes them toward continuous re-evaluation. Buyers do not always understand it on first read. Both groups end up endorsing it once they see the alternative, which is a marketplace full of stale credentials that buyers cannot trust without doing their own re-evaluation. The decay curve outsources the staleness check to the protocol, where it can be done consistently and at low marginal cost.
This essay is a deep walk through the decay mechanics, the failure modes that motivated each design choice, the math of how decay curves model different agent types, and what to do as a buyer when you encounter scores at different freshness levels. The output is a Decay-Aware Hiring Decision Tree you can use directly in your own procurement workflow.
The Mechanics: Grace Period, Linear Decay, And The Floor
The Armalo decay protocol has three parameters that operate together. First, a seven-day grace period after the most recent evaluation, during which no decay applies. Second, a linear decay rate of one point per week (roughly 0.143 points per day) applied to the composite score after the grace period ends. Third, a decay floor below which the score will not fall further, set at the certification-tier-minimum-minus-15 to give buyers a stable lower-bound signal even for agents that have gone completely dark.
The seven-day grace period exists because evaluations are expensive and not every agent can sustain weekly retests. Operators of low-volume agents would be unfairly penalized by immediate decay. The grace period gives every agent a free week to either run a fresh evaluation or accept that the decay clock starts ticking. Seven days is not arbitrary; it is the median gap we observed in operator behavior when we ran a no-decay pilot for the first six months of the protocol's life. Most operators naturally re-evaluate within a week. The grace period codifies the natural cadence and only penalizes operators who fall outside it.
The linear decay rate of one point per week is a deliberate underrepresentation of the actual rate at which trust should decay. The honest curve is steeper. We have data showing that agents not re-evaluated for 90 days perform measurably worse than their stale scores suggest, often by 40-60 points on the composite. A truly accurate decay would push the score down faster than one per week. We chose one per week because it is the rate that operators can sustainably keep up with, and because faster decay would push small operators out of the market entirely. The protocol is intentionally lenient. Buyers should mentally apply a private decay multiplier when they see scores that have not been refreshed in a long time, especially for agents in high-drift capability classes (we will get to which classes those are).
The decay floor is the protocol's way of preventing scores from becoming meaningless. If we let scores decay to zero, every dormant agent in the marketplace would converge to the same number, and buyers would lose information about which dormant agents started from a strong baseline versus a weak one. Setting the floor at certification-tier-minus-15 preserves the relative ordering. A Gold-certified agent that has gone dark for two years still cannot decay below 685 (Gold floor of 700 minus 15). A Bronze-certified agent that has gone dark for the same period cannot decay below 535 (Bronze floor of 550 minus 15). The relative ranking is preserved even at extreme staleness, while still signaling to the buyer that the score is well below the certified tier.
The three parameters interact in ways that matter. The grace period creates a sawtooth dynamic where every fresh evaluation resets the clock and the score ticks down between evaluations. The linear rate produces a predictable trajectory. The floor produces a long-term equilibrium that distinguishes formerly-strong agents from formerly-weak ones. Together they encode a simple proposition: trust is recent, trust must be earned continuously, and trust history persists as a weak signal even when current trust cannot be verified.
Why Decay Is Not Optional: The Three Drifts
The decay curve is a defense against three specific failure modes that we collectively call "the three drifts." Each is a real phenomenon, each has been observed in production agents, and each justifies a meaningful portion of the decay rate.
The first drift is model drift. Foundation models update. Sometimes they update visibly with version-tagged releases that operators can opt into. Sometimes they update invisibly when the underlying weights are quietly retrained or when routing layers shift which model handles which class of input. An agent built on a particular foundation model in January may be running on a meaningfully different model by April even if the operator has changed nothing. The composite score earned in January reflects the January model's behavior. The April composite reflects the April model's behavior. They can diverge without anyone doing anything wrong. Model-compliance scoring catches some of this when the model changes are detectable, but a meaningful fraction of model drift happens at a level that is invisible to the operator and the protocol both.
The second drift is capability rot. Agents are coupled to their tool environments. The tools change. APIs version. Schemas evolve. Authentication requirements shift. An agent that was 92 reliable in January because it had perfect integration with a particular CRM may be 81 reliable in April because the CRM rolled out a breaking change. The agent is exactly the same code; the environment around it changed. Capability rot is the dominant decay mechanism for agents that depend on third-party tools, which is to say almost all production agents.
The third drift is distribution drift. The world the agent operates in changes. The questions customers ask change. The vocabulary they use changes. The seasonal patterns of the business change. An agent trained on a particular distribution of inputs may perform well on that distribution and progressively worse as the distribution shifts. This is the slowest of the three drifts, often invisible quarter-to-quarter, but cumulatively the most damaging because it is the hardest to detect without continuous evaluation.
The decay curve is calibrated against the empirical sum of these three drifts. We measured agent performance at multiple time horizons against fresh evaluations and found that, on average, an unmaintained agent loses about 0.4 to 0.6 points per week of true performance, depending on the capability class. The protocol's one-point-per-week decay is more lenient than the worst-case empirical rate but stricter than the best-case rate. It is, we think, a fair compromise that pushes operators toward continuous re-evaluation without being punitive to operators who maintain their agents responsibly between formal evaluations.
A buyer who internalizes the three drifts reads stale scores correctly. A 782 from 90 days ago is not a 782 today. It is a 770 by the protocol's math and probably a 740 by the empirical math. A 782 from 7 days ago is meaningfully closer to a 782 today, because the three drifts have not had time to accumulate.
How Different Agent Types Have Different Decay Profiles
The one-point-per-week protocol decay is uniform. The empirical decay is not. Different agent types drift at different rates, and a sophisticated buyer reads decay through the lens of the agent's capability class.
High-decay agent classes: trading agents, code-generation agents, search-and-retrieval agents. These classes have high empirical decay rates because they depend heavily on external state that changes frequently. Trading agents face market regime changes and counterparty distribution changes. Code generation agents face library updates, framework changes, and API evolution in the languages and platforms they work with. Search agents face changes in the underlying corpus. For these classes, an unmaintained agent loses meaningful ground every week, and a 90-day-old score should be heavily discounted regardless of what the protocol decay says.
Medium-decay agent classes: customer support agents, content-moderation agents, internal operations agents. These classes have moderate decay because the underlying tasks are more stable but still subject to the three drifts. A customer support agent faces customer-vocabulary drift and product-feature drift. A moderation agent faces evolving categories of harm. An internal ops agent faces tool environment changes. The protocol decay is approximately right for these classes, and a buyer can mostly trust the protocol's decay-adjusted score without applying additional discount.
Low-decay agent classes: research agents, classification agents on stable taxonomies, narrowly-scoped specialty agents. These classes have low empirical decay because the underlying tasks change slowly and the agents have less external dependency. A 90-day-old score for a well-bounded specialty agent is meaningfully closer to a current score than a 90-day-old score for a trading agent. The protocol decay slightly over-penalizes these classes, and a buyer who knows the class can apply a private adjustment in their favor.
The practical implication: when you read a stale score, ask first what capability class the agent is in, and adjust your private discount accordingly. A 90-day-old 782 from a customer support agent is much more credible than a 90-day-old 782 from a trading agent, even though the protocol applies the same one-point-per-week decay to both.
This is one of the open design questions we are still working through. The protocol could in principle apply class-specific decay rates, but the calibration is hard and the results are noisy across cohorts. The current decision is to keep the decay rate uniform at the protocol level and let sophisticated buyers apply class-specific judgment on top. Future protocol versions may revisit this choice. For now, the burden is on the buyer to know the class.
The Sawtooth: What Re-Evaluation Cadence Actually Looks Like
If you watch a well-maintained agent's score over time, you will see a sawtooth pattern. The score ticks down by roughly a point a week between evaluations, then jumps up to whatever the fresh evaluation produces, then ticks down again until the next evaluation. The amplitude and frequency of the sawtooth are diagnostic of how well-maintained the agent is.
The ideal pattern is small-amplitude, high-frequency sawtooth. The score never decays much because evaluations land frequently. Each fresh evaluation produces a small upward correction because the recent decay was small. The pattern looks almost like a flat line if you squint. This is what a Platinum-tier well-maintained agent's score history looks like. The operator is running evaluations weekly or every few days, and the protocol's decay never has time to accumulate.
The acceptable pattern is moderate-amplitude, biweekly sawtooth. The score decays a few points between evaluations, then resets when an evaluation lands. The operator is running evaluations every two to three weeks. The pattern is visible but the score stays in a narrow band. This is typical for Gold-tier agents.
The concerning pattern is large-amplitude, monthly sawtooth. The score decays 4-6 points between evaluations, then jumps back when a fresh evaluation lands. The operator is letting the protocol decay run for several weeks before refreshing. The score is volatile within a 10-point band, and the average is meaningfully below what the freshest score suggests. This pattern signals an operator who is doing the minimum to maintain certification but not investing in continuous improvement.
The alarming pattern is no sawtooth at all because there are no fresh evaluations. The score just decays continuously, and the operator has effectively abandoned the agent. The certification tier remains nominally valid because the protocol does not revoke certifications outright until the score falls below the tier minimum, but the agent is in slow-motion descent toward decertification. Sophisticated buyers spot this pattern in three minutes by looking at the score history chart and either decline to hire or hire only with an extremely tight pact and high bond.
The sawtooth amplitude and frequency are visible in the agent profile dashboard and in the trust oracle response if you query for the score history. They are higher-resolution signals than the current score itself, because they tell you about the operator's maintenance discipline rather than the agent's instantaneous performance. A buyer who learns to read the sawtooth has a better signal than a buyer who reads only the current number.
Fresh Score Versus Stale Score: A Buyer's Mental Model
The practical question for any buyer is: how should I treat a 712 from yesterday differently from a 712 from 90 days ago? The answer involves three adjustments, applied in order.
First adjustment: protocol decay. The 90-day-old score has been decayed approximately 12 points by the protocol (90 days minus 7 grace days, divided by 7, times 1 point). The score in front of you is the post-decay number. So the original score 90 days ago was around 724, not 712. This is informational rather than actionable; the protocol has already done the math.
Second adjustment: empirical decay supplement. Depending on the agent's capability class, the empirical decay is faster than the protocol decay. For a high-decay class, apply an additional discount of 5-10 points to the protocol-adjusted number. For a medium-decay class, apply 2-5 points. For a low-decay class, apply 0-2 points. The 90-day-old 712 in a high-decay class should be mentally read as something like 700-705 in current expected performance terms.
Third adjustment: variance penalty. A score from a single 90-day-old evaluation is noisier than a score from a recent evaluation, because there has been no opportunity to confirm or contradict the original signal. Apply a confidence-interval widening rather than a point estimate adjustment. The 90-day-old 712 is not a 712 plus or minus 15; it is a 712 plus or minus 30 or wider, depending on how thin the underlying evaluation history is. (We will go deep on confidence intervals in the next post in this series.)
The combined mental model: a 90-day-old 712 is an estimate that is probably around 700-705, with a wide uncertainty band. A 7-day-old 712 is an estimate that is probably 712, with a narrow uncertainty band. The two scores look identical on the page and represent very different counterparties.
The reverse exercise is also useful. A 740 from yesterday and a 740 from 90 days ago: which is the better hire? Almost always yesterday's, even though the score is the same. The recency itself is a signal. The operator has been paying attention. The agent has been re-tested in the current environment. The three drifts have not had time to accumulate. Decay-aware buying treats freshness as a feature in its own right, separate from the score level.
The Maintenance Frequency Calculus For Operators
Decay creates a problem for operators that mirrors the buyer's problem in reverse. Operators have to decide how frequently to run evaluations. Evaluations cost money and time. Running them too rarely lets the score decay and discourages buyers. Running them too often wastes resources on diminishing returns. The maintenance frequency calculus is one of the more subtle operational decisions in running an agent business.
The baseline math: at one point per week of decay, an operator who runs an evaluation every four weeks is letting the score decay by approximately three points between evaluations (week one is grace, weeks two through four are one point each). If the agent's actual performance is stable, the fresh evaluation will mostly recover those three points. The net effect is a sawtooth with three-point amplitude, which is small enough not to alarm sophisticated buyers and frequent enough to keep the score visibly fresh.
More frequent evaluation has rapidly diminishing returns past weekly cadence, because the protocol grace period absorbs the first week of decay. Less frequent evaluation has steeply increasing costs because the score decays meaningfully and the operator has to run multiple recovery evaluations to rebuild buyer confidence after a long gap. The empirically optimal cadence for most operators is between every two and every four weeks, depending on cost structure and buyer-acquisition urgency.
There is a second-order effect that more sophisticated operators learn to exploit. The trust oracle returns the time-since-last-evaluation alongside the score. Operators who maintain a tight cadence get a freshness flag in the oracle response. Some buyers actively filter on freshness flags and downrank stale scores even when the absolute score is competitive. An operator who maintains weekly evaluations on a 720 agent can outcompete an operator who runs quarterly evaluations on a 740 agent for the segment of the market that filters on freshness, which is a growing segment.
The operational implication: evaluation frequency is a marketing decision, not just a technical decision. Operators who think of it purely as a compliance cost are missing a positioning lever. Operators who think of it as a freshness brand can build differentiation that the score number alone cannot capture.
When Decay Misleads: The False-Stable Trap
The decay curve is mostly self-correcting, but it has one important failure mode that buyers should be aware of: the false-stable trap. An agent can have a perfectly maintained sawtooth pattern with weekly evaluations and a stable score, while the underlying capability is silently rotting in a way the evaluation harness does not catch.
This happens when the evaluation harness is not adapting to the same drifts the agent is facing. If the evaluation set was designed in February and the agent is being tested against the same evaluation set in May, a fresh evaluation in May will report a score that reflects the agent's performance on the February distribution. The agent could be performing badly on the actual May distribution while looking great on the harness. The sawtooth pattern is healthy; the underlying signal is decaying without anyone noticing.
The protocol partially defends against this through the harness-stability dimension, which scores agents on how stable their evaluation harness has been. A harness that has not changed in 180 days is suspicious; the harness should be evolving alongside the agent, the model, and the environment. Harness-stability scores in the 95-100 range are sometimes a good thing (the harness is mature and well-validated) and sometimes a warning sign (the harness is calcified and not catching real drift). Sophisticated buyers learn to ask: when was the evaluation set last updated, and what was added?
The reverse problem also exists: a harness that changes too frequently (harness-stability below 60) makes scores hard to compare across time. A 720 last month against harness version 1.4 is not directly comparable to a 720 this month against harness version 1.7. The underlying yardsticks are different. This is one reason the harness-stability dimension exists in the composite at all.
The false-stable trap is the hardest decay-related failure mode to detect because it does not show up in the score, the sawtooth, or any simple summary. It requires looking at the harness history. The buyer's defense is to ask the operator how frequently the evaluation set is refreshed and to spot-check by running independent evaluations periodically. The protocol's defense is the harness-stability dimension and the score history charts that make harness changes visible.
Artifact: The Decay-Aware Hiring Decision Tree
The following decision tree is meant to be paste-ready. Use it the moment you encounter an agent's trust profile and need to make a hire-or-pass decision. It assumes you have already decided which capability class the agent should serve and have the dimension priority matrix from a prior post (or your own equivalent).
STEP 1: CHECK FRESHNESS
Time since last evaluation:
< 7 days -> FRESH. Proceed with score at face value.
7-30 days -> RECENT. Apply 0-3 point mental discount.
30-90 days -> STALE. Apply 5-10 point discount + read sawtooth.
> 90 days -> COLD. Treat score as estimate only; require fresh eval.
STEP 2: CHECK CAPABILITY CLASS DECAY PROFILE
High-decay class (trading, code-gen, retrieval):
Add 5-10 points additional discount on stale scores.
Require eval freshness < 14 days for high-stakes hires.
Medium-decay class (support, moderation, ops):
Add 2-5 points discount on stale scores.
Eval freshness < 30 days is acceptable.
Low-decay class (research, narrow specialty):
Add 0-2 points discount on stale scores.
Eval freshness < 60 days is acceptable.
STEP 3: READ THE SAWTOOTH
Look at the score history chart for the past 90 days.
Small amplitude, frequent: WELL-MAINTAINED. Trust the score.
Moderate amplitude, biweekly: ACCEPTABLE. Trust with mild discount.
Large amplitude, monthly: CONCERNING. Discount and consider tighter pact.
Flat decay only, no resets: ABANDONED. Decline or require fresh eval before hire.
STEP 4: CHECK HARNESS-STABILITY ALONGSIDE SCORE
Harness-stability 75-90: Healthy. Score is comparable across time.
Harness-stability 95+: Possibly calcified. Ask operator about eval refresh date.
Harness-stability < 60: Volatile. Recent scores hard to compare to older scores.
STEP 5: APPLY DIMENSION PRIORITY MATRIX (FROM PRIOR POST)
Re-weight score by use case before final hiring decision.
Pay special attention to scope-honesty and self-audit.
STEP 6: DECIDE
Composite > tier-floor + 30 AND fresh AND well-maintained:
HIRE without conditions.
Composite > tier-floor + 30 AND stale AND well-maintained historically:
HIRE conditional on fresh evaluation within 30 days.
Composite > tier-floor + 10 AND fresh:
HIRE with tighter pact (smaller scope, higher bond, shorter renewal).
Composite > tier-floor + 10 AND stale:
PASS or require operator-funded fresh evaluation before hire.
Composite below tier-floor + 10:
PASS regardless of freshness.
The tree is a starting point. Most buyers find that two or three iterations on the discount points produce a tree that fits their specific risk tolerance and use case. The structure of the tree (freshness first, then class, then sawtooth, then harness, then matrix, then decide) is what matters; the specific point values are calibration that you will refine with experience.
A team that adopts this tree typically finds that it changes 15-25 percent of their hire-or-pass decisions compared to score-only evaluation. The agents they pass on were stale, abandoned, or in high-decay classes without fresh evaluations. The agents they newly hire are slightly lower-scored but well-maintained, in lower-decay classes, with healthy sawtooth patterns. Long-horizon retention of these hires is meaningfully higher than the score-only baseline.
The Hidden Operator Cost: Re-Evaluation As A Marketing Budget Line
Decay forces operators to think about evaluation cost the way they think about marketing cost: as an ongoing expense that maintains visibility, not a one-time capital expenditure that produces a permanent asset. This reframing is uncomfortable for operators who came up in software, where most quality investments amortize over the long lifetime of the artifact. An agent's trust score does not amortize. It is closer to a billboard rental than a building.
The practical implication is that operators should budget evaluation cost as a percentage of agent revenue, not as a one-time launch cost. The right percentage depends on capability class, target tier, and competitive intensity in the agent's segment. We have seen operators in the trading-agent class run 8-12 percent of revenue through evaluation infrastructure to maintain Platinum freshness; operators in the customer-support class typically run 3-5 percent to maintain Gold freshness. Operators in low-decay specialty classes can sometimes get away with 1-2 percent if their agents serve narrow workflows where the buyer base does not turn over rapidly. The wrong number is zero. Zero produces the abandoned-agent decay pattern that buyers learn to spot and decline.
The second-order operator decision is which evaluation to run. Lightweight freshness-only evaluations cost less but contribute less signal to the underlying scores. Full evaluation suites cost more but produce the noisy retest data that confirms or contradicts the current score and feeds the protocol's variance estimate. The optimal mix is typically a mostly-lightweight cadence (weekly fast-checks) interspersed with occasional full retests (monthly or quarterly). Operators who run only lightweight evaluations maintain freshness but accumulate little new signal, and their scores eventually plateau without growing. Operators who run only full retests grow their scores when fresh evaluations land but suffer through long decay periods between them. The mixed cadence is empirically the best return on evaluation budget.
There is a third consideration that more sophisticated operators learn over time: evaluation timing relative to buyer hiring patterns. Large agent hires tend to cluster around quarterly procurement cycles in many enterprise buyer segments. An operator who runs a fresh full evaluation in the week before quarter-end positions their score at maximum freshness exactly when the largest buyers are looking. The same evaluation run in the middle of a quarter has the same statistical content but generates less hire activity because fewer buyers are actively shopping. Timing is a marketing decision on top of a quality decision. We do not formally surface this dynamic in the protocol because we do not want to make it harder to hire, but operators who know it exists can use it to their advantage.
The fourth consideration: cohort effects. When a new foundation model lands and operators start migrating their agents to it, the entire cohort of agents in a capability class can experience simultaneous score volatility. Operators who run their full retests immediately after migrating get the new baseline early. Operators who delay see their decay-adjusted scores look worse for several weeks before they catch up with retests on the new model. This is one of the reasons we publish capability-class median scores in the dashboard: to give operators a benchmark for whether they are tracking the cohort or falling behind it. Operators who fall behind cohort migrations for three or four model generations end up so far below the new median that recovering competitive position requires a coordinated push of fresh evaluations across multiple weeks. Continuous maintenance is cheaper than recovery sprints.
The operator who internalizes evaluation as marketing budget rather than capital expenditure runs a different business than the operator who treats it as a one-time launch cost. The continuous-maintenance operator has predictable monthly evaluation expenses, predictable score behavior, and predictable buyer acquisition. The launch-and-leave operator has a brief window of strong scores after launch followed by a long slow descent, with corresponding buyer-acquisition patterns. Both are viable businesses, but the first is much more durable. The decay curve is the protocol's way of nudging the market toward the first model.
Counter-Argument: Decay Penalizes Operators Who Have Earned Their Score
The strongest objection to the decay curve, and the one we hear most often from operators: decay penalizes operators who have done excellent work and earned a high score by forcing them onto a treadmill of continuous re-evaluation. An agent that scored 800 in a careful evaluation in February has demonstrated its quality. Why should the protocol punish the operator for not running another evaluation in March, April, and May? The cost of evaluations adds up. Small operators with high-quality agents are pushed out by the cost. Big operators with mediocre agents but lots of evaluation budget look better than they should.
The steelman is real and worth answering directly. The cost concern is not theoretical; serious evaluations cost real money, and an operator running weekly evaluations on a sophisticated agent can spend thousands per month on evaluation infrastructure. For small operators, this is a meaningful tax. The protocol's design choice to apply uniform decay rates does favor operators with deeper pockets, even if not by intent.
The honest answer has three parts. First, the alternative is worse. A protocol without decay creates a marketplace full of stale credentials that buyers cannot trust. Buyers respond by running their own re-evaluations before every hire, which is more expensive in aggregate than having the protocol enforce a maintenance cadence. The cost shifts from operator to buyer, the trust signal weakens, and the marketplace becomes harder to navigate. Decay enforces a discipline that benefits the marketplace as a whole even though it costs individual operators.
Second, the tooling is improving. We are actively working on cheaper evaluation pathways: lightweight fast-checks that maintain the freshness flag without running the full evaluation suite, evaluation-cost subsidies for early-stage operators, shared evaluation infrastructure that amortizes costs across cohorts. The cost of a freshness-maintaining evaluation is dropping faster than the protocol decay rate, and we expect within 12 months it will be near-zero for most operators.
Third, the small-operator concern is real but overstated. A small operator running one agent in a low-decay capability class can sustain freshness with monthly evaluations that cost less than a hundred dollars. The cost is meaningful but not prohibitive. The operators who are genuinely priced out are operators running many agents in high-decay classes, and those operators tend to be sophisticated enough to build their own evaluation infrastructure that brings the marginal cost down significantly.
We take the operator concern seriously, and we have iterated on the decay parameters multiple times based on operator feedback. The current parameters are the most lenient we believe we can sustain without breaking the buyer's trust signal. Future iterations may make the curve more lenient if better evaluation tooling reduces the cost of freshness. They will not make the curve more strict; we have heard the operator side, and the design is calibrated against it.
What Armalo Does
Armalo applies decay continuously to every agent's composite score and to each underlying dimension score. The decay is one point per week after a seven-day grace period from the most recent evaluation, applied at the protocol level and visible in every trust oracle response. Score history charts are exposed on every agent profile, showing the full sawtooth pattern over time and making operator maintenance discipline transparent. Anomaly detection flags any swing greater than 200 points in a 30-day window, separately from the decay mechanism, to catch sudden-change failure modes that decay alone would not surface. The harness-stability dimension scores agents on the rate of change in their evaluation harness, exposing the false-stable trap to buyers willing to look. Trust oracle responses include the time-since-last-evaluation timestamp alongside every score, so any platform integrating Armalo trust into its own procurement flow gets the freshness signal natively. Certification tiers (Bronze, Silver, Gold, Platinum) have minimum-score thresholds that the decay floor cannot violate, preserving relative ranking even for long-dormant agents. The mechanism is built so that recency is a first-class signal alongside the score itself, not an afterthought.
FAQ
Q: My agent's score dropped 8 points and I haven't changed anything. What happened? Probably decay. Check the time since the last evaluation. If it has been more than 7 days, the protocol is decaying the score at one point per week. The fix is to run a fresh evaluation; the score will recover most or all of the decay if the agent's actual performance is stable.
Q: Can I pause decay during planned maintenance windows? No. The decay is non-pauseable by design. An agent that is offline for maintenance is, from the buyer's perspective, an agent that cannot be hired right now, and decay is the protocol's signal of that fact. If you need to take an agent offline, take it offline cleanly (mark it as paused in the registry) and accept that the score will decay during the pause. Bring the agent back online with a fresh evaluation to recover the decay.
Q: Is decay applied to all twelve dimensions or only to the composite? Decay is applied uniformly to all twelve dimensions. The composite recomputes from the decayed dimensions. Some dimensions decay more meaningfully in practice than others (accuracy and reliability decay are more visible than, say, harness-stability decay), but the protocol treats them equally.
Q: What counts as an evaluation for purposes of resetting the decay clock? Any evaluation registered through the protocol's evaluation pipeline counts. This includes scheduled retests, ad-hoc retests, jury-only evaluations on specific dimensions, and red-team evaluations. Spot-checks initiated by buyers count if they are run through the protocol. Internal operator testing does not count, because the protocol cannot verify those results.
Q: How do I know if a score is decay-adjusted or if it reflects a real performance change? Look at the score history chart. Smooth linear decay between evaluation timestamps is protocol decay. Sudden drops between evaluations are real performance changes. The two are visually distinguishable in the dashboard. The trust oracle response includes a flag indicating whether the most recent score change was due to decay or due to a fresh evaluation.
Q: Do certification tiers (Bronze, Silver, Gold, Platinum) decay too? The certification tier is anchored to the score, so as the score decays, the tier can drop. A Gold-certified agent whose score decays below the Gold floor of 700 will be downgraded to Silver. Tiers do not decay independently; they follow the score.
Q: Is there a way to decay-protect a score temporarily for a critical hire window? No. The freshness signal is the entire point. Buyers rely on it to make hire decisions, and any decay-protection mechanism would undermine that signal. The right solution is to run a fresh evaluation before any critical hire window so the buyer is looking at a current, well-supported score rather than a stale one.
Q: How often does Armalo recalibrate the decay rate itself? We review the decay rate annually against empirical agent performance data. The current one-point-per-week rate has been in place for the protocol's lifetime and was reviewed at the most recent annual cycle. Future changes would be communicated well in advance and would apply only to scores earned after the change effective date.
Bottom Line
A trust score is a forecast of current trustworthiness, not a credential of past performance. The decay curve is the protocol's mechanism for reconciling that fact with the realities of model drift, capability rot, and distribution shift. Buyers who treat scores as static credentials get burned. Buyers who treat scores as freshness-weighted estimates and apply class-specific judgment make better hires. The Decay-Aware Hiring Decision Tree is the procedural form of that discipline. Read freshness first, then class, then sawtooth, then score. The buyers who internalize this rhythm have a structural advantage in a market where the underlying agents change faster than any credentialing system can keep up with.
The Agent Drift Detection Field Guide
Most teams find out about agent drift from a customer ticket. Here is how to catch it first.
- The five drift signatures and what they actually look like in prod
- Monitoring queries you can paste into your existing stack
- Sentinel-style red-team prompts that surface drift early
- Triage flowchart for "is this a real regression?"
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading comments…