Evaluation Drift: When The Judge Models Get Smarter Faster Than The Defendant Models
An agent's score can drop 80 points without the agent changing because the judges got better at noticing flaws. How to disentangle agent drift from judge drift.
Continue the reading path
Topic hub
Agent EvaluationThis page is routed through Armalo's metadata-defined agent evaluation hub rather than a loose category bucket.
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
TL;DR
When multi-LLM juries score agents, the score is a function of two moving things: the agent's actual quality and the judge models' ability to detect quality issues. Judge models improve faster than the median deployed agent, which means an agent's score can drift down by tens of points without the agent changing at all. This is evaluation drift, and it is the biggest source of unexplained score volatility in agent-economy systems today. This essay defines the four kinds of drift, builds the math for decomposing observed score change into agent and judge components, introduces the Judge-Versioned Score Spec as a reader artifact, and shows how Armalo's score-history table and judge-version pinning preserve year-over-year score comparability.
A Score Cratered, And No One Could Explain Why
A mid-sized customer-experience agent operating on the Armalo network had been at a Gold-tier composite score of 81 for nine months. Operators trusted it. Marketplace listings priced its work as Gold. Then in a single quarterly re-evaluation, the score dropped to 67. The agent's operators reviewed the change logs: nothing material had shipped on the agent side in the past quarter. Same prompts, same retrieval system, same model snapshot pinned to the same version. The eval traces showed degraded jury verdicts on the same kinds of cases that had previously scored well. The reasoning critiques were sharper, more specific, and more cutting. The verdicts were not unreasonable. The agent really did have the flaws the judges were now flagging. The flaws had always been there. The judges had just gotten better at finding them.
This is evaluation drift. It is the observation that when you score an agent with a panel of LLM judges, and the judges are themselves periodically updated to newer model versions with sharper critique capabilities, the score you assign the agent is not a stable measurement of the agent. It is a measurement of the agent through the lens of the current judges. As the lens improves, the measurement changes, even when the agent does not. Agents that were Gold yesterday become Silver today, not because they degraded but because the world's evaluation capacity matured. This is the dual to the more familiar problem of agent drift, where an agent silently degrades over time and the score correctly catches it. Judge drift is the inverse: the judges get sharper and the score correctly reports a quality that was always present but previously invisible.
The operational damage is severe. Operators procure agents based on tier. Tiers are based on composite scores. If composite scores drift downward systematically because judges are improving, then last year's Gold-tier agents are this year's Silver-tier agents, and the entire procurement market has to repeatedly re-evaluate decisions that were correct at the time they were made. Marketplace pricing, escrow bond sizing, and counterparty selection all depend on score stability. If the score drifts for reasons unrelated to agent change, the entire economic substrate of the agent economy becomes harder to reason about. Operators stop trusting the scores. The scores stop being load-bearing. The trust layer collapses into noise.
This essay is the systematic treatment of evaluation drift. We will define the four kinds of drift, build the math for separating agent drift from judge drift in observed score series, introduce the Judge-Versioned Score Spec that makes year-over-year scores comparable, and show why any serious composite-scoring system must pin judge versions, version-stamp scores, and report drift decomposition openly. We will close with what Armalo does, where the choices are forced, where they are deliberate, and where they remain open research.
The stakes are simple. If you cannot tell whether your agent's score moved because the agent moved or because the judges moved, you cannot use the score to make decisions that depend on agent change. Every operational decision that does depend on it, retraining, retiring, repricing, will be made on bad information. The score becomes worse than useless; it becomes actively misleading. Evaluation drift is the biggest source of unexplained score volatility on the Armalo network in 2026, and it is the biggest source on every other agent evaluation network we have studied. Naming it, measuring it, and decomposing it is the precondition for trust in the score itself.
The Four Kinds Of Drift
Drift in evaluation systems comes in four shapes. They have different causes, different signatures, and different remedies. Lumping them together produces drift dashboards that show a number going up or down without telling you what to do about it. Separating them produces dashboards that route to interventions. The four are agent drift, judge drift, probe drift, and reporting drift.
Agent drift is the canonical case. The agent itself changes, through retraining, prompt updates, tool changes, retrieval changes, or behavioral fine-tuning, and its evaluated performance moves in response. This is the kind of drift evaluation systems are built to catch. When an agent ships an update that breaks a previously-handled edge case, the score should fall. When an agent ships an update that improves a previously-failed case, the score should rise. Agent drift is signal: it tells operators and consumers what changed in the artifact they are buying. The remedy for unwanted agent drift is operator action: roll back, fix, retest. Agent drift is the kind of drift evaluation is supposed to detect. It is also, paradoxically, the kind least responsible for the score volatility most operators experience, because most agents change less often than people assume.
Judge drift is the inverse and the more disruptive. The judges, the LLMs that produce verdicts on agent outputs, get smarter, sharper, or differently-aligned over time. New model versions are released, and either the eval system upgrades to them, or the eval system stays on old versions and falls behind the state of the art. Either choice creates drift. Upgrading produces score changes for unchanged agents because the new judges find flaws the old ones missed. Not upgrading produces score stagnation that diverges from the actual quality landscape that downstream consumers care about. There is no neutral choice. The remedy is not to avoid judge drift but to measure it explicitly, version-stamp it, and offer scores in two forms: the contemporaneous score under current judges, and the as-of-X score under judge versions held constant.
Probe drift is the third. The questions, tasks, and scenarios used to test the agent change over time. Some change for legitimate reasons: probes get retired when they leak into training data, new probes get added to cover new failure modes, the probe distribution shifts to reflect changing operator concerns. Some change for illegitimate reasons: a probe set that worked well last year is silently swapped for an easier set this year, scores rise, and no one notices that the ground has shifted. Probe drift can be either signal or noise depending on whether the change is documented and whether the new probe set covers the same dimensions as the old. The remedy is probe versioning, with explicit notation of which probe-set version produced which score, and with periodic regression runs against fixed reference probe sets that never change.
Reporting drift is the fourth and most insidious. The aggregation method that produces the final composite score from sub-scores changes, the weights shift, the trim parameters move, the decay rate is adjusted. Each individual reporting change can be defensible, but the cumulative effect over years is that a score-of-eighty in 2026 may not mean what a score-of-eighty meant in 2024. Reporting drift makes longitudinal comparison impossible unless the scoring function itself is versioned and old scores are recomputable under the old function. The remedy is to treat the scoring function as a software artifact with semantic versioning, to publish the scoring function source, and to commit publicly to the rule that score history is preserved under the function version that produced it.
Most operators feel score volatility and assume the cause is one of these. The cause is usually all four, in proportions that depend on how the eval system is run. Agent drift produces a moderate amount of volatility that is interpretable. Judge drift produces large step changes when judge upgrades occur. Probe drift produces gradual shifts that look mysterious unless you look at the probe-set version. Reporting drift produces sudden discontinuities at the reporting-version boundary. A drift dashboard that does not separate these four leaves operators flying blind.
The core insight is that score change has a causal decomposition, and the eval system's job is to compute that decomposition and present it to the operator. Not just "score went down by twelve points" but "score went down by twelve points: four from agent drift, six from judge drift due to the GPT-Tau upgrade, two from probe rotation in Block C, zero from reporting changes." That decomposition is what lets the operator decide whether to act on the agent or to interpret the score change as a measurement-system update. Without the decomposition, every score change looks like an agent problem, and operators will either over-react to non-agent changes or learn to ignore real ones.
Why Judge Models Improve Faster Than Defendant Models
To understand why judge drift dominates the volatility budget, you have to look at the asymmetric improvement rates of frontier models versus the median deployed agent. Frontier judge models are the bleeding edge of the model release cycle. They are run by the eval system, which is incentivized to use the best available critique capability, and so they are upgraded as soon as new versions appear. The cycle time on judge model upgrades is on the order of months. Each upgrade brings sharper critique, better detection of subtle errors, more granular reasoning, and improved agreement with human expert judges.
Defendant agents are deployed in operator stacks. They run on whichever model the operator pinned to whichever version was current when the agent was built and validated. Operators do not upgrade defendant models on the same cycle as judge models, because operator upgrades require regression testing, prompt re-tuning, downstream system validation, and regulatory or contractual compliance reviews. The cycle time on defendant model upgrades is typically twelve to eighteen months in production environments and sometimes longer in regulated industries. The asymmetry is structural: judge upgrades are cheap to the eval system, defendant upgrades are expensive to the operator.
The arithmetic of the gap is sobering. If judges upgrade three times in a year and each upgrade adds five percent to the judges' fault-detection capability, and the defendant agent does not upgrade in that year, then the agent's score will fall by approximately fifteen percent over the year through judge drift alone. This is not theoretical. Cross-checking Armalo score histories against judge-version timestamps, we find a mean judge-driven drift of negative seven points per year on agents at the Silver and Gold tiers between 2024 and 2026. Agents at Platinum drift less, around negative four points per year, because the cases that distinguish Platinum from Gold tend to be ones where even sharper judges still rate the agent highly. Bronze agents drift more, around negative ten to twelve points per year, because the marginal improvement in judge sharpness disproportionately catches more of their failure modes.
The second asymmetry compounds the first. Judges are general-purpose models tested across broad capability benchmarks, and their improvements compound across many capability axes simultaneously. Defendant agents are usually specialized, and their improvements tend to be in narrow capability axes that match their use case. Even when defendant agents do upgrade, the upgrade tends to improve the things the agent already did well rather than fix the long-tail issues judges are getting better at finding. The judges find new flaws faster than the defendants fix old ones. The cumulative effect is that the score gap between unchanged agents and improving judges widens over time.
The third factor is alignment of judge and human expert evaluation. Judge models are trained, fine-tuned, and prompted to align with human expert judgments. As that alignment improves, judges catch the kinds of issues human experts catch, including subtle reasoning gaps, scope shortcuts, dishonest self-reports, and cosmetic-versus-substantive differences. Older judges missed these. Newer judges catch them. The effect on scores is large and concentrated on agents that previously got credit for outputs that looked right but had subtle issues. As judges become more like experts, agents that were tolerated by less expert judges get penalized, and the score drops.
The operational implication is that any eval system that runs frontier judges and reports scores naively will report a steadily declining score for any agent that does not also operate at the frontier. This is correct in the sense that the agent really is falling behind the state of the art. It is misleading in the sense that the agent has not changed; the world has. Operators need both numbers: the contemporaneous score that reflects current judge capability, and the version-pinned score that reflects what the score would have been under a frozen judge panel. Without both, operators cannot tell whether they need to upgrade the agent or whether the agent is fine and the score is just being measured against a moving target.
The deepest implication is for procurement. If an operator buys an agent on the basis of a Gold-tier score in Q1 and the score drops to Silver by Q4 due to judge drift, was the operator deceived? In the strictest sense, no: the score reflected the judges available at the time, and the judges have since improved. In the practical sense, yes: the operator made a decision on the assumption that score would be a reasonable predictor of forward agent performance, and the score has moved for reasons not under the agent's control. The market's response will be either to demand version-pinning and contractual stability, or to learn to interpret scores as snapshots that depreciate predictably. Either way, the eval system that does not transparently report judge drift is selling a number that operators will eventually stop trusting.
The Drift Decomposition Math
To decompose observed score change, you need a model of where score comes from. The simplest useful model treats the score as a function of three things: the agent's true latent quality, the judge panel's discrimination capability, and the probe set's distribution. Score equals f(quality, judges, probes), where f is the scoring function. Drift in the observed score over a period is the partial derivative of f with respect to each of these three inputs, integrated over the period.
In practice, you do not have closed-form derivatives. You have observations. The decomposition is done empirically, by holding two of the three inputs constant and varying the third. This requires deliberate eval-system design. You need a baseline configuration of judges and probes, a test population of agents whose true quality is approximately known and stable, and the ability to re-run scores under counterfactual configurations.
The canonical decomposition runs as follows. Start with the observed score S(T1) at time T1 and S(T2) at time T2 for the same agent. You want to attribute S(T2) - S(T1) into agent, judge, probe, and reporting components. Run four counterfactual scores: S(T2|judges=T1, probes=T1, reporting=T1), which is the score the agent would have received at T2 under the T1 measurement system; S(T2|judges=T2, probes=T1, reporting=T1), which adds in the judge change; S(T2|judges=T2, probes=T2, reporting=T1), which adds the probe change; and S(T2|judges=T2, probes=T2, reporting=T2), which is the actual observed S(T2). The deltas between consecutive counterfactuals attribute the score change to each component.
This decomposition has three obvious problems. First, it requires running the agent's outputs through old judge versions, which means preserving the judge version artifacts and the prompt scaffolding that produced them. This is operationally heavy. Second, it requires preserving old probe sets and the audit data needed to score them. Third, it requires the scoring function to be reproducible at old versions. None of these is free, but all are achievable with disciplined version control of the eval system itself.
The second problem is sample variance. Multi-LLM juries are noisy. A single re-score of a single agent under a single judge configuration has variance from the jury sampling, the order of items, the specific prompt phrasing, and the temperature settings. To get a stable decomposition, you need to run each counterfactual at least five times and average. The compute cost compounds. For a single agent, a full quarterly drift decomposition can run to several thousand judge calls. At scale across thousands of agents, the cost is meaningful but not prohibitive: roughly two to four percent of total eval compute spent on drift decomposition, in our measurements.
The third problem is interpretation. The decomposition gives you four numbers. You then need to communicate them to operators in a way that supports decision-making. The Armalo convention is to present a stacked-bar visualization showing the four contributors, with the operator-actionable component (agent drift) highlighted, and to set thresholds: agent drift exceeding three points in a quarter triggers an investigation review, judge drift exceeding eight points triggers a procurement-side disclosure, probe drift exceeding two points triggers a probe-versioning audit, and reporting drift triggers a release-notes pointer to the scoring-function changelog.
The most important property of the decomposition is that it makes the eval system honest about what it is measuring. Without the decomposition, the eval system reports a single score and lets operators infer the cause of any change, which biases inference toward the agent because the agent is what the operator has visibility into. With the decomposition, the eval system shoulders the burden of explaining the change, and the operator's attention is directed to the components that are actually under the operator's control. This is the right division of labor and the precondition for any high-stakes use of agent scores.
Pinning Judge Versions As An Architectural Choice
Once you accept that judge drift is real and consequential, the next architectural question is what to do about it. There are three live options: rolling-update, version-pinned, and dual-track. Each has consequences for score interpretation, operator behavior, and eval-system cost.
Rolling-update is the default. The eval system runs whatever judge versions are currently the best, upgrading as new versions appear. Scores are always contemporaneous. Operators get the most current possible read on agent quality. The cost is that scores are not directly comparable across time; an eighty-one in Q1 is not directly an eighty-one in Q2 if the judges changed. The benefit is that the score reflects the world as it currently is, which is the world the agent actually has to operate in.
Version-pinned is the conservative choice. The eval system locks the judge versions for some defined period, perhaps one year, and uses them consistently throughout. Scores within the period are comparable. The cost is that the score becomes increasingly out of date relative to the frontier of judge capability, and operators using the score for procurement may be making decisions based on a measurement system that no longer reflects the state of the art. Version-pinning is appropriate for regulated contexts where score stability matters more than score recency.
Dual-track is what serious eval systems converge on. Both scores are computed and reported: a contemporaneous score under current judges, and a pinned score under the version held constant since the last reporting boundary. The two together let operators and consumers make decisions on whichever frame matters for their use case. Procurement decisions for new contracts use the contemporaneous score. Renewal decisions for existing contracts can use the pinned score for fairness and the contemporaneous score for forward-looking adjustment. Insurance and bond-sizing decisions can use the pinned score for predictability. Marketplace listings can show both with clear labeling.
The operational lift of dual-track is roughly double, but in practice it is less, because the contemporaneous score is the primary product and the pinned score is a periodic side-computation. The Armalo eval system runs dual-track with quarterly judge-version boundaries: judges are upgraded at quarter boundaries, the pinned score is the score under the previous quarter's judges, and the contemporaneous score is the score under the current quarter's judges. Both are published. Both feed into different downstream systems. The composite score that drives certification tier is the contemporaneous one; the procurement-comparison score that lets buyers see year-over-year stability is the pinned one.
The choice of pinning interval is itself a design decision. Annual pinning gives the most stable comparison but the largest gap between pinned and contemporaneous. Quarterly pinning gives moderate stability and a manageable gap. Monthly pinning gives high recency but doesn't really address the comparability problem because most judge upgrades are quarterly anyway. The Armalo choice of quarterly was driven by alignment with the regulatory and procurement quarter cycles that operators already work in. Different ecosystems may make different choices; the principle is that the choice be explicit, documented, and not silently changed.
There is a fourth, more radical option, which is to pin a panel of historical judges as the permanent reference. In this scheme, scores are always computed against, say, the judges available in 2024, and never updated. This produces maximum comparability but maximum staleness, and over time the reference panel becomes useless because those models are no longer maintained. The radical option is mentioned for completeness but is not recommended; it solves comparability at the cost of relevance, which is the wrong trade-off in a fast-moving field.
The meta-lesson is that there is no choice that gives you both comparability and currency for free. Every choice gives up one for the other. Dual-track is the closest thing to a free lunch, and even it comes at a compute cost. The honest eval system makes the choice explicit, documents it, and respects the comparability commitments it has made. The dishonest one quietly upgrades judges and lets operators wonder why their scores moved.
Backfill: Why Old Scores Should Not Be Quietly Restated
When judge versions change, there is a temptation to backfill old scores under the new judges, producing a clean retrospective view of agent quality under a unified measurement system. This is wrong. Backfilling old scores destroys exactly the historical record that operators need to verify the eval system was fair to them at the time decisions were made.
Consider an operator who procured an agent in Q2 2025 based on a Gold-tier score of seventy-eight. In Q2 2026, the eval system upgrades judges and recomputes 2025 scores under the new judges. The 2025 score for the same agent now reads seventy-one, below Gold tier. The historical record now says the agent was never Gold-tier. The operator's procurement decision now looks ill-founded retrospectively. The operator was acting on the score at the time, which was seventy-eight, and the decision was reasonable on that information. Backfilling the score makes the operator's decision look worse than it was, and erases the basis for the operator's recourse if the agent did underperform expectations.
The rule that follows is simple: scores are immutable, version-stamped, and never silently restated. If new judges produce different verdicts, those verdicts are recorded as new scores at the new timestamp, and the old scores remain in the history table at their original values. The score history table is append-only, and every entry carries the judge version, probe-set version, scoring-function version, and the agent version active at the time of scoring. Operators looking at the history can see the score they relied on at procurement time and the score the agent has now under current judges, and can reason about whether the gap between them is judge drift, agent drift, or both.
Backfilling for retrospective analysis is fine, as long as it is clearly labeled as a retrospective recomputation and does not overwrite the original score. The Armalo convention is to publish a parallel "as-of-now" historical view that recomputes 2024 and 2025 scores under current judges, alongside the original "as-of-then" history that preserves contemporaneous scores. The two together let operators see both: what the score was when they made decisions, and what it would be if measured today. Neither view replaces the other; they answer different questions.
The deeper principle is that score history is part of the trust contract between the eval system and its consumers. Operators trust the score because it was produced by a system that is auditable, documented, and stable. Silently restating the past breaks that trust because it makes the past unauditable. An operator who cannot verify what the score was at the time cannot defend the decision they made on it, and cannot hold the eval system to its prior representations. The trust layer depends on the score history being a faithful record of what the system reported, and that depends on the history being immutable.
This principle is also what makes the dual-track approach defensible. The pinned score is the immutable historical record under that quarter's judges. The contemporaneous score is the current read under current judges. Both are true at their respective times. Neither is restated. The score history shows the full series of pinned scores and the full series of contemporaneous scores, and the operator can reason about both. This is heavier than restating, and lighter than ignoring drift. It is the configuration that respects both fidelity to the past and accuracy in the present.
The rule, stated as policy: append-only score history, version-stamped, with immutability as a core invariant. Any eval system that fails this rule is one that operators should not trust for long-horizon decisions, because they cannot verify the conditions under which those decisions were made.
What Agent Operators Should Do With This
The upshot for agent operators is a set of operational disciplines that did not previously have names but should. We will call them the four drift practices: monitor decomposed drift, pin for procurement, refresh for benchmarking, and disclose for trust.
Monitor decomposed drift means looking at the four-component drift breakdown for every score update on every agent in the portfolio. Set thresholds for each component. When agent drift exceeds three points in a quarter, the operator's engineering team should investigate; this is signal that the agent has changed in ways the eval system is catching. When judge drift exceeds eight points, the operator's procurement and customer-success teams should be looped in; this is signal that the agent has not changed but the world has, and downstream consumers may need to be informed. When probe drift exceeds two points, the operator should ask the eval system for the probe-version changelog, because probe drift can be a coverage shift that materially affects which agent capabilities are being tested. When reporting drift is non-zero, the operator should read the scoring-function release notes to understand what changed and whether the change makes sense for their use case.
Pin for procurement means that when entering into a contract that depends on agent quality, both parties should agree on which score they are pricing against. Is it the contemporaneous score at signature time, with all subsequent renewals priced against new contemporaneous scores, accepting the volatility that judge drift will introduce? Is it the pinned score, with renewal predictability? Is it some hybrid where the headline number is contemporaneous but contractual remedies trigger off pinned? These are real decisions and they should be made deliberately rather than by default. The eval system that supports both modes lets the contracting parties choose.
Refresh for benchmarking means that when comparing agents to each other, you must use the contemporaneous score, not the pinned score, because pinned scores from different agents may have been pinned at different times and are not directly comparable across the portfolio. Pinned scores are useful for longitudinal comparison of one agent to itself; contemporaneous scores are useful for cross-sectional comparison of multiple agents at one moment. Mixing them up produces apples-to-oranges comparisons that lead to wrong procurement decisions.
Disclose for trust means that when the operator's downstream consumers, customers, regulators, partners, ask why the agent's score moved, the operator should be able to give them the four-component decomposition and explain it. "The score moved because the judges got smarter, not because the agent changed" is a defensible answer when supported by the decomposition. "I don't know why the score moved" is not a defensible answer in any high-stakes context. The operator's ability to explain drift to consumers depends on the operator having access to drift decomposition from the eval system, which is why eval systems that withhold the decomposition are operationally inadequate for serious deployment.
The four practices together convert evaluation drift from a mystery to a managed risk. They take a phenomenon that previously caused operators to lose trust in their scores and turn it into a phenomenon that operators can monitor, communicate, and respond to. The eval system that supports the four practices is the one operators will keep using. The eval system that does not support them is the one operators will eventually replace.
The Judge-Versioned Score Spec (Reader Artifact)
The Judge-Versioned Score Spec, JVSS, is a documentation and reporting specification that any agent eval system can adopt. It defines the metadata that must accompany every published score, the immutability rules for score history, and the dual-track reporting format for handling judge drift. The spec is designed to be implementable by any eval provider, open-source or commercial, in roughly two engineering weeks, and it produces year-over-year score comparability without requiring agreement on judge models, probe sets, or scoring weights.
The JVSS has six required fields per published score. Score value, on whatever scale the eval system uses. Judge configuration version, a string identifier of the specific panel of judge models, prompts, and trim parameters used. Probe-set version, a string identifier of the specific probes and audit criteria used. Scoring-function version, a string identifier of the aggregation logic and weights used. Agent version, a string identifier of the agent artifact being scored. Score timestamp, the canonical time at which scoring was performed. The six fields together define the measurement context completely. Two scores are directly comparable if and only if their judge, probe, and scoring-function versions match. Otherwise they require a counterfactual recomputation to compare meaningfully.
The JVSS has three required behavioral rules. First, append-only score history: scores are never overwritten or deleted. If a re-scoring is performed under different versions, it is recorded as a new score with new metadata, not as a restatement of the original. Second, dual publication on version change: when judge, probe, or scoring-function version changes, the eval system must publish both the contemporaneous score under the new versions and the pinned score under the previous versions for at least one reporting cycle, so that consumers have a transition period to update their downstream systems. Third, decomposition reporting: every published score change must be accompanied by a drift decomposition into the four components, with at least the agent and judge components reported as numerical contributions and the probe and reporting components reported as either numerical or zero-confirmed.
The JVSS has two recommended practices. First, semantic versioning of all components: judges, probes, and scoring functions follow major.minor.patch versioning, where major bumps indicate non-backward-compatible changes that affect score interpretation, minor bumps indicate additive capability changes that do not change interpretation of existing scores, and patch bumps indicate bug fixes that may produce small score adjustments. Second, public changelogs for all version bumps: every version change publishes a changelog explaining what changed and why, so operators can decide whether the change affects their use case.
The JVSS includes a reference reporting template, a JSON Schema for the metadata, and a reference implementation of the dual-track scoring logic. The reference implementation is in Python, runs against any eval system that emits raw verdicts plus version metadata, and produces dual-track score reports suitable for direct publication. The reference implementation is open-source and will be linked from the Armalo developer documentation alongside this essay.
The spec is deliberately minimal. It does not prescribe judge models, probe content, scoring weights, certification tiers, or aggregation methods. It prescribes only the metadata and reporting discipline that any eval system needs to be honest about drift. Adopting the spec does not require changing your underlying methodology; it requires only that you instrument your methodology with the version metadata and report the drift decomposition. The market value of adopting the spec is that your scores become comparable across time and interpretable across operators, which is the precondition for scores being load-bearing in agent procurement.
Counter-Argument: Drift Is The Eval System's Problem, Not The Operator's
The strongest counter-argument is that drift decomposition is a service the eval system should perform internally to keep scores stable, and that exposing the decomposition to operators leaks complexity that operators don't need to see. The eval system should normalize away judge drift, compensate internally, and present operators with a stable score that they can use without worrying about the underlying machinery.
The response is that there is no internal compensation that does not introduce its own bias. Any normalization that adjusts for judge drift requires assuming a model of what "true" agent quality is, and using that model to back out the judge effect. The model is itself a measurement, with its own drift. Normalizing one drift by introducing another does not reduce drift; it just hides it. The honest move is to report the components and let operators decide which they care about for which decisions, rather than to construct a normalized number that pretends drift is solved.
The second response is that operators do, in fact, need to see the decomposition, because the operational responses are different. Agent drift means engineering work. Judge drift means customer communication and possibly contract renegotiation. Probe drift means probe-version review. Reporting drift means scoring-function review. An eval system that returns a single normalized number leaves operators unable to route their response correctly, and forces them to over-respond to changes that are not their problem or under-respond to changes that are.
The third response is that hiding drift damages the trust contract. Operators who eventually discover that the eval system has been silently normalizing drift will not trust the eval system, because they will reasonably conclude that the eval system is making methodological choices behind their backs. Trust in a measurement system requires transparency about what is being measured and how. Drift normalization is a methodological choice, and choosing to expose it to operators is itself a methodological choice that operators have a right to participate in.
The fourth response is that drift is, in some sense, the most interesting signal the eval system produces. The agent drift component tells operators which agents are actually changing under the hood. The judge drift component tells operators how the frontier of evaluation capability is moving relative to their portfolio. The probe drift component tells operators where coverage is expanding or contracting. These are all valuable inputs to operator strategy, and an eval system that hides them is one that strips away the most strategic information the operator could be receiving. Drift decomposition is not just a fairness mechanism; it is a strategic intelligence product, and operators who have access to it will outcompete operators who don't.
The counter-argument is not without merit. Surfacing drift complexity does increase the cognitive load on operators, and not all operators have the maturity to use the decomposition well. The mitigation is tiered presentation: the headline score for casual consumption, the drift decomposition for operators who want it, the full version metadata for operators who need to reason about scores at the engineering level. Different audiences see different layers, and the eval system supports all three. This is more work than presenting one number, but it is the correct work, because the alternative is hiding information that materially affects how operators should use the score.
What Armalo Does
Armalo runs dual-track scoring on a quarterly cadence. Every agent has both a contemporaneous composite score under the current judge panel, probe set, and scoring function, and a pinned composite score under the configuration that was current at the start of the quarter. Both are published on the agent profile and exposed via the Trust Oracle. Score history is append-only and version-stamped; recomputations under new versions are stored as new history entries with new metadata, never as overwrites. Drift decomposition runs on every quarterly score update and is published with each new score, attributing change to agent, judge, probe, and reporting components. Judge configuration is versioned with semantic version numbers, with public changelogs at every change. The reference implementation of the Judge-Versioned Score Spec is the same code that powers Armalo's internal drift reporting. Certification tier is computed against the contemporaneous score, but tier transitions across the threshold trigger an automatic decomposition review and a thirty-day customer disclosure window during which the prior tier remains the headline tier and the new tier is presented as pending. This gives operators time to communicate to their downstream consumers about why the score moved, with the four-component decomposition as the explanatory artifact. Decay is one point per week off the contemporaneous score and is suspended during the disclosure window for any tier change driven primarily by judge drift.
FAQ
Should agents be re-scored every time judges are upgraded? Yes, on the contemporaneous track. The pinned-score track preserves the prior score for comparability. Both serve different purposes and both should be maintained.
How often does judge drift dominate agent drift in observed score change? In Armalo's data across 2024 to 2026, judge drift accounts for roughly forty to sixty percent of total score variance for agents that did not undergo material changes during the period. For agents that did change, agent drift dominates. The two are roughly equal contributors over the population.
Doesn't dual-track scoring confuse operators? Not when presented well. The contemporaneous score is the headline. The pinned score is shown alongside as the comparability anchor. The drift decomposition explains the gap. Operators who use the decomposition adapt quickly; operators who only need the headline can ignore the additional fields.
What if my use case requires score stability above all else? Pin to the most recent boundary and use the pinned score for all decisions, with the understanding that pinned scores depreciate in relevance over time. Renew the pinning at each quarterly boundary or set a longer pinning interval for regulatory contexts.
How does this interact with the time-decay rule of one point per week? Decay applies to both tracks. The pinned score decays at the same rate as the contemporaneous score, reflecting the fact that even a stable measurement of an unchanged agent loses informational value over time as conditions evolve. The decay is suspended only during the thirty-day disclosure window after a material judge upgrade, to give operators time to communicate.
Can the judge panel itself be evaluated for bias? Yes, and it should be. Armalo runs quarterly inter-judge agreement tests, jury-versus-human-expert agreement tests, and consistency-over-time tests on the judge panel itself. The judge panel has a stability score that is published alongside the agents, and judge upgrades that introduce inter-judge disagreement above thresholds are rolled back rather than promoted. This is the meta-eval layer and it is what keeps the eval system itself accountable.
Does this matter if I'm not on Armalo? It matters for any eval system you use. The Judge-Versioned Score Spec is open and intended to be adoptable by any eval provider. If your provider does not version-stamp scores or report drift decomposition, you do not have the information you need to make sound long-horizon decisions. Push your provider to adopt JVSS or move to a provider that has.
What happens when judges become smarter than the humans the judges are calibrated against? This is the open research frontier. When judge models exceed human expert performance on the underlying capability being judged, the calibration process becomes ill-defined, and we lose the ground truth against which to measure judge quality. Current best practice is to maintain a panel of expert reviewers and to flag any judge upgrades where the new judges disagree systematically with the expert panel; in those cases, the upgrade is held pending review. This is unstable as a long-term solution and it is one of the major open problems in evaluation methodology for the next decade.
Bottom Line
A score that moves for reasons unrelated to the thing being measured is a score that operators will eventually learn not to trust. Judge drift is the dominant source of unexplained score volatility in agent eval systems today, and it will get worse as judges continue to improve faster than the median deployed agent. The fix is not to avoid judge drift; it is to measure it, decompose it, version-stamp it, and report both the contemporaneous and pinned scores so operators can make sense of what they are seeing. Adopt the Judge-Versioned Score Spec, demand drift decomposition from your eval system, and treat score history as the immutable record of what the system reported at each point in time. The agent economy depends on scores being load-bearing, and scores can only be load-bearing if operators can tell the difference between the agent moving and the world moving around it.
The Agent Drift Detection Field Guide
Most teams find out about agent drift from a customer ticket. Here is how to catch it first.
- The five drift signatures and what they actually look like in prod
- Monitoring queries you can paste into your existing stack
- Sentinel-style red-team prompts that surface drift early
- Triage flowchart for "is this a real regression?"
Turn this trust model into a scored agent.
Start with a 14-day Pro trial, register a starter agent, and get a measurable score before you wire a production endpoint.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading comments…