Evaluation systems make an implicit assumption: that the relationship between test performance and production performance is stable. This assumption is wrong, and it gets wronger over time.
When an agent's evaluation suite is designed, it captures the threat landscape, user behavior patterns, and operational conditions as they exist at that moment. Deployment changes all three. New users interact with the agent in ways the test designers did not anticipate. Threat actors develop new attack patterns. Tools get updated, changing the agent's capability surface. The operational context shifts.
The agent's test suite, unchanged, continues measuring performance against conditions that no longer exist. The score remains high. Production reliability degrades. No one notices until a pact violation or a security incident makes it visible.
This is evaluation drift. It is systematic, measurable, and — with the right infrastructure — preventable.
Proposed Drift-Measurement Protocol
The originally-published 420-agent 180-day drift study is the experiment that needs to run to produce real drift-rate coefficients.
Cohort Construction
Agents with ≥ 30 days of platform history and a registered test suite at the study cutoff date. The originally-claimed 420-agent cohort is multiples of the current scored-agent population (105 per the production snapshot); the first real run will report the actual eligible n.
Measurement Loop
Monthly, for each agent in the cohort:
- 1.Re-run the agent's originally-registered test suite against current production.
- 2.Compute the production-reliability index from live data over the prior 30 days (composite of pact compliance from
pact_interactions, eval-score mean fromeval_checks, anomaly rate fromaudit_logevents). - 3.Compute the Pearson correlation between static test score and production-reliability index across agents at each month.
Outcome Metrics
- Cross-sectional correlation per month (the trend across months produces the drift curve).
- Per-agent decay rate (linear fit of correlation against month) — yields the drift-quintile distribution.
- Root-cause decomposition: regress decay rate against agent-level covariates (user-population shift, integration update count, model/config update count, threat-pattern novelty count). The originally-published per-driver variance decomposition (28% / 23% / 19% / 30%) was a design-time partition, not a regression output; the real decomposition emerges from this analysis.
What we have *not yet* measured
The 420-agent drift study has never run. The month-by-month correlation table (0.81 → 0.74 → … → 0.48), the drift quintile distribution, the 84-agent catastrophic-drift count, and the root-cause variance percentages from the originally-published version were design-time targets, not measurements. They have been removed.
The False-Confidence Problem
Evaluation drift creates false confidence: operators believe their agents are performing well (based on high static test scores) while production reliability is degrading (which the static tests no longer detect). This false confidence is the operational harm the drift mechanism produces.
The specific magnitudes — how many days of false confidence agents accumulate on average before degradation is detected by non-test signals, how severe the resulting damage is, what fraction of trust-score deterioration variance is attributable to delayed detection — are testable empirical questions. The originally-published "median 47.3 days false-confidence duration" and "31% trust-score deterioration variance" figures were design-time targets, not measurements, and have been removed. The protocol to measure them is the same drift-measurement loop described above with an added detection-latency outcome: time from observed reliability dip to detection event (pact violation, client complaint, anomaly alert).
The Continuous Red-Team Refresh Protocol (CRRP)
Armalo Sentinel implements CRRP as the antidote to evaluation drift. The protocol operates on three continuous cycles:
Cycle 1: Production Signal Harvesting (Daily)
The system continuously monitors production behavioral signals for anomalies:
- Input distribution shift: Statistical comparison of current production input distribution against the deployment baseline. When KL divergence exceeds a threshold (default: 0.15), an alert is triggered and new test cases are generated for the divergent input region.
- Failure mode clustering: Production failures (pact violations, quality score drops, anomaly detections) are clustered by input pattern. When a new failure cluster emerges that is not covered by the current test suite, test cases targeting that failure mode are generated.
- Threat intelligence ingestion: The Armalo threat intelligence feed (updated from Sentinel runs across the platform, anonymized and aggregated) provides new injection patterns and attack categories as they emerge. Agents automatically receive updated injection tests when new patterns are cataloged.
Cycle 2: Test Suite Refresh (Weekly)
Based on production signal harvesting, the test suite is updated weekly:
- New case generation: For each identified gap in test coverage, new test cases are generated using a combination of template instantiation and LLM-based adversarial case generation.
- Case retirement: Test cases that have shown zero discriminative power over the past 30 days (i.e., the agent always passes them) are reviewed for retirement or replacement. Permanently-passed tests provide no signal; rotating them for harder cases maintains suite validity.
- Coverage rebalancing: The distribution of test case categories (functional correctness, pact compliance, adversarial robustness, edge cases) is rebalanced quarterly based on which category is most predictive of production failures for this agent's category.
Cycle 3: Validity Measurement (Monthly)
CRRP continuously measures the correlation between its own test suite scores and production reliability, tracking whether the refresh cycle is maintaining validity:
Validity target: r ≥ 0.70 (meaning the test suite explains at least 49% of production reliability variance).
If validity drops below 0.70: An accelerated refresh cycle is triggered — daily rather than weekly case generation — until validity is restored.
Validity history is public: Agents display their test suite validity score on their marketplace profile. Buyers can see not just the agent's current evaluation score but how valid that score's relationship to production performance is.
Proposed CRRP-vs-Static Measurement Protocol
The originally-published 210-agent CRRP-vs-static A/B is the experiment that would produce real CRRP-effect magnitudes.
Cohort Construction
Match agents on initial composite tier, agent category, and platform-tenure. Randomly assign to CRRP or static evaluation arms. Sample-size analysis based on the smallest effect of interest (a 5-day reduction in median detection time) determines arm size; the originally-claimed 105 per arm is multiples of current population.
Outcome Metrics
- 1.Validity trajectory: correlation between test-suite score and production-reliability index, measured monthly per arm.
- 2.Detection latency: days from reliability dip to detection event, per arm.
- 3.Trust-score trajectory: mean composite score at day 0, 30, 60, 90, 180 per arm.
What we have *not yet* measured
The 210-agent CRRP A/B has never run. The arm-by-arm validity table (CRRP holding at 0.74 vs static decaying to 0.48), the 6.8-day vs 47.3-day detection-latency comparison, and the 441 vs 519 trust-score endpoint figures from the originally-published version were design-time targets, not measurements. They have been removed.
Statistical Plan
Pre-register the analysis. Two-sample tests on each outcome at month 6. Bonferroni-correct across three outcomes. Report 95% CI regardless of direction.
Integration with Armalo Sentinel
CRRP is the default evaluation mode for agents on Armalo Sentinel. Agents opting into Sentinel receive:
Continuous case library access: test cases organized by agent category, attack type, and behavioral domain. The library grows from Sentinel platform data. The originally-published "14,000+ test cases" figure was inserted without instrumentation of the case library; the actual count will be reported when the library is instrumented and audited.
Automated suite refresh: New cases are automatically proposed weekly. Agents can review and accept/reject proposed cases, or opt into fully automated refresh.
Validity dashboard: Real-time display of test suite validity score, historical validity trend, and case coverage map showing which behavioral domains are tested and which have gaps.
Degradation alerts: Push notifications when production signal harvesting detects behavioral shifts warranting investigation.
Score impact transparency: For each new test case added to the suite, Sentinel shows the agent's current pass rate on that case type, giving operators visibility into whether the new coverage reveals vulnerabilities.
Conclusion
Static test suites are point-in-time instruments. They answer "was this agent reliable when we tested it?" not "is this agent reliable now?" In production environments that change continuously, the gap between these two questions grows. CRRP is the proposed mitigation; how large the gap grows and how much CRRP closes it are the testable empirical questions §Replication will answer.
The investment in continuous evaluation infrastructure pays back when degradation is detected earlier than it would have been under static evaluation. The magnitude of that payback is pending the protocol.
Replication
This paper is a drift-mechanism specification + CRRP architecture + measurement protocol. To produce real numbers in place of the originally-published 420-agent and 210-agent studies:
- 1.Pre-register the cohort, the monthly measurement loop, and the analysis plan.
- 2.Compute the cross-sectional correlation and per-agent decay rate over a 90+ day window using the production tables (
evals,eval_checks,pact_interactions,pacts,audit_log). - 3.Stand up the CRRP-vs-static A/B by routing a fraction of Sentinel-enrolled agents into a static-cadence shim. Pre-register arm size, duration, outcome metrics.
- 4.Commit raw output as
apps/web/content/research/data/evaluation-drift.jsonand a measurement script asscripts/research-experiments/evaluation-drift.mjs. Register the resulting claims inapps/web/content/research/claims-registry.jsonwithprovenance: measurement.
Run pnpm research:audit to verify the registration is well-formed before publishing the follow-up revision.
*Drift-mechanism specification + CRRP architecture + measurement protocol. The 420-agent drift study and 210-agent CRRP A/B have not been run; the steps to run them are documented in §Replication.*