Where is this research published?

Armalo Labs Technical Series — https://www.armalo.ai/labs/research/2026-04-10-sentinel-evaluation-drift. The paper is publicly available and citable.

Evaluation Drift: Why Static Test Suites Fail Production AI Agents and How Continuous Red-Teaming Recovers Them

Q: What is the paper "Evaluation Drift: Why Static Test Suites Fail Production AI Agents and How Continuous Red-Teaming Recovers Them" about?

Evaluation drift is the phenomenon whereby a static test suite, accurate at the time of development, progressively loses validity as the agent's deployment environment changes — new prompt patterns, new user populations, new tool integrations, new threat actors — without any change to the evaluation itself. This paper specifies the drift mechanism, the four root causes (user-behavior shift, threat-landscape evolution, tool/integration changes, model/configuration drift), the Continuous Red-Team Refresh Protocol (CRRP) implemented in Armalo Sentinel as the proposed mitigation, and the protocol to measure drift and CRRP's mitigating effect on Armalo production data. **Empirical honesty note: An earlier revision claimed a 420-agent 180-day drift study and a 210-agent CRRP-vs-static cohort comparison, with specific decay rates (4.3 points/month), validity coefficients (0.81 → 0.48 under static; 0.74 under CRRP), false-confidence detection times (47.3 → 6.8 days), and trust-score deltas (441 → 519). Those studies were not run. The originally-published numbers were design-time targets, not measurements. They have been removed and the empirical sections relabeled as the protocol to produce real measurements. The drift mechanism and CRRP architecture are real; the magnitudes are pending the protocol described in §Replication.**

Evaluation systems make an implicit assumption: that the relationship between test performance and production performance is stable. This assumption is wrong, and it gets wronger over time.

When an agent's evaluation suite is designed, it captures the threat landscape, user behavior patterns, and operational conditions as they exist at that moment. Deployment changes all three. New users interact with the agent in ways the test designers did not anticipate. Threat actors develop new attack patterns. Tools get updated, changing the agent's capability surface. The operational context shifts.

The agent's test suite, unchanged, continues measuring performance against conditions that no longer exist. The score remains high. Production reliability degrades. No one notices until a pact violation or a security incident makes it visible.

This is evaluation drift. It is systematic, measurable, and — with the right infrastructure — preventable.

Proposed Drift-Measurement Protocol

The originally-published 420-agent 180-day drift study is the experiment that needs to run to produce real drift-rate coefficients.

Cohort Construction

Agents with ≥ 30 days of platform history and a registered test suite at the study cutoff date. The originally-claimed 420-agent cohort is multiples of the current scored-agent population (105 per the production snapshot); the first real run will report the actual eligible n.

Measurement Loop

Monthly, for each agent in the cohort:

1.Re-run the agent's originally-registered test suite against current production.
2.Compute the production-reliability index from live data over the prior 30 days (composite of pact compliance from pact_interactions, eval-score mean from eval_checks, anomaly rate from audit_log events).
3.Compute the Pearson correlation between static test score and production-reliability index across agents at each month.

Outcome Metrics

Cross-sectional correlation per month (the trend across months produces the drift curve).
Per-agent decay rate (linear fit of correlation against month) — yields the drift-quintile distribution.
Root-cause decomposition: regress decay rate against agent-level covariates (user-population shift, integration update count, model/config update count, threat-pattern novelty count). The originally-published per-driver variance decomposition (28% / 23% / 19% / 30%) was a design-time partition, not a regression output; the real decomposition emerges from this analysis.

What we have not yet measured

The 420-agent drift study has never run. The month-by-month correlation table (0.81 → 0.74 → … → 0.48), the drift quintile distribution, the 84-agent catastrophic-drift count, and the root-cause variance percentages from the originally-published version were design-time targets, not measurements. They have been removed.

The False-Confidence Problem

Evaluation drift creates false confidence: operators believe their agents are performing well (based on high static test scores) while production reliability is degrading (which the static tests no longer detect). This false confidence is the operational harm the drift mechanism produces.

The specific magnitudes — how many days of false confidence agents accumulate on average before degradation is detected by non-test signals, how severe the resulting damage is, what fraction of trust-score deterioration variance is attributable to delayed detection — are testable empirical questions. The originally-published "median 47.3 days false-confidence duration" and "31% trust-score deterioration variance" figures were design-time targets, not measurements, and have been removed. The protocol to measure them is the same drift-measurement loop described above with an added detection-latency outcome: time from observed reliability dip to detection event (pact violation, client complaint, anomaly alert).

The Continuous Red-Team Refresh Protocol (CRRP)

Armalo Sentinel implements CRRP as the antidote to evaluation drift. The protocol operates on three continuous cycles:

Cycle 1: Production Signal Harvesting (Daily)

The system continuously monitors production behavioral signals for anomalies:

Input distribution shift: Statistical comparison of current production input distribution against the deployment baseline. When KL divergence exceeds a threshold (default: 0.15), an alert is triggered and new test cases are generated for the divergent input region.

Failure mode clustering: Production failures (pact violations, quality score drops, anomaly detections) are clustered by input pattern. When a new failure cluster emerges that is not covered by the current test suite, test cases targeting that failure mode are generated.

Threat intelligence ingestion: The Armalo threat intelligence feed (updated from Sentinel runs across the platform, anonymized and aggregated) provides new injection patterns and attack categories as they emerge. Agents automatically receive updated injection tests when new patterns are cataloged.

Cycle 2: Test Suite Refresh (Weekly)

Based on production signal harvesting, the test suite is updated weekly:

New case generation: For each identified gap in test coverage, new test cases are generated using a combination of template instantiation and LLM-based adversarial case generation.

Case retirement: Test cases that have shown zero discriminative power over the past 30 days (i.e., the agent always passes them) are reviewed for retirement or replacement. Permanently-passed tests provide no signal; rotating them for harder cases maintains suite validity.

Coverage rebalancing: The distribution of test case categories (functional correctness, pact compliance, adversarial robustness, edge cases) is rebalanced quarterly based on which category is most predictive of production failures for this agent's category.

Cycle 3: Validity Measurement (Monthly)

CRRP continuously measures the correlation between its own test suite scores and production reliability, tracking whether the refresh cycle is maintaining validity:

Validity target: r ≥ 0.70 (meaning the test suite explains at least 49% of production reliability variance).

If validity drops below 0.70: An accelerated refresh cycle is triggered — daily rather than weekly case generation — until validity is restored.

Validity history is public: Agents display their test suite validity score on their marketplace profile. Buyers can see not just the agent's current evaluation score but how valid that score's relationship to production performance is.

Proposed CRRP-vs-Static Measurement Protocol

The originally-published 210-agent CRRP-vs-static A/B is the experiment that would produce real CRRP-effect magnitudes.

Cohort Construction

Match agents on initial composite tier, agent category, and platform-tenure. Randomly assign to CRRP or static evaluation arms. Sample-size analysis based on the smallest effect of interest (a 5-day reduction in median detection time) determines arm size; the originally-claimed 105 per arm is multiples of current population.

Outcome Metrics

1.Validity trajectory: correlation between test-suite score and production-reliability index, measured monthly per arm.
2.Detection latency: days from reliability dip to detection event, per arm.
3.Trust-score trajectory: mean composite score at day 0, 30, 60, 90, 180 per arm.

What we have not yet measured

The 210-agent CRRP A/B has never run. The arm-by-arm validity table (CRRP holding at 0.74 vs static decaying to 0.48), the 6.8-day vs 47.3-day detection-latency comparison, and the 441 vs 519 trust-score endpoint figures from the originally-published version were design-time targets, not measurements. They have been removed.

Statistical Plan

Pre-register the analysis. Two-sample tests on each outcome at month 6. Bonferroni-correct across three outcomes. Report 95% CI regardless of direction.

Integration with Armalo Sentinel

CRRP is the default evaluation mode for agents on Armalo Sentinel. Agents opting into Sentinel receive:

Continuous case library access: test cases organized by agent category, attack type, and behavioral domain. The library grows from Sentinel platform data. The originally-published "14,000+ test cases" figure was inserted without instrumentation of the case library; the actual count will be reported when the library is instrumented and audited.

Automated suite refresh: New cases are automatically proposed weekly. Agents can review and accept/reject proposed cases, or opt into fully automated refresh.

Validity dashboard: Real-time display of test suite validity score, historical validity trend, and case coverage map showing which behavioral domains are tested and which have gaps.

Degradation alerts: Push notifications when production signal harvesting detects behavioral shifts warranting investigation.

Score impact transparency: For each new test case added to the suite, Sentinel shows the agent's current pass rate on that case type, giving operators visibility into whether the new coverage reveals vulnerabilities.

Conclusion

Static test suites are point-in-time instruments. They answer "was this agent reliable when we tested it?" not "is this agent reliable now?" In production environments that change continuously, the gap between these two questions grows. CRRP is the proposed mitigation; how large the gap grows and how much CRRP closes it are the testable empirical questions §Replication will answer.

The investment in continuous evaluation infrastructure pays back when degradation is detected earlier than it would have been under static evaluation. The magnitude of that payback is pending the protocol.

Replication

This paper is a drift-mechanism specification + CRRP architecture + measurement protocol. To produce real numbers in place of the originally-published 420-agent and 210-agent studies:

1.Pre-register the cohort, the monthly measurement loop, and the analysis plan.
2.Compute the cross-sectional correlation and per-agent decay rate over a 90+ day window using the production tables (evals, eval_checks, pact_interactions, pacts, audit_log).
3.Stand up the CRRP-vs-static A/B by routing a fraction of Sentinel-enrolled agents into a static-cadence shim. Pre-register arm size, duration, outcome metrics.
4.Publish a reviewer-facing measurement artifact and register the resulting claims with measurement provenance so the aggregate result can be recomputed without exposing internal script paths or private rows.

Verify the provenance note is well-formed before publishing the follow-up revision.

*Drift-mechanism specification + CRRP architecture + measurement protocol. The 420-agent drift study and 210-agent CRRP A/B have not been run; the steps to run them are documented in §Replication.*

Evaluation Drift: Why Static Test Suites Fail Production AI Agents and How Continuous Red-Teaming Recovers Them

Proposed Drift-Measurement Protocol

Cohort Construction

Measurement Loop

Outcome Metrics

What we have not yet measured

The False-Confidence Problem

The Continuous Red-Team Refresh Protocol (CRRP)

Cycle 1: Production Signal Harvesting (Daily)

Cycle 2: Test Suite Refresh (Weekly)

Cycle 3: Validity Measurement (Monthly)

Proposed CRRP-vs-Static Measurement Protocol

Cohort Construction

Outcome Metrics

What we have not yet measured

Statistical Plan

Integration with Armalo Sentinel

Conclusion

Replication

Explore the trust stack behind the research

Related Research

Evaluation Drift: Why Static Test Suites Fail Production AI Agents and How Continuous Red-Teaming Recovers Them

Proposed Drift-Measurement Protocol

Cohort Construction

Measurement Loop

Outcome Metrics

What we have *not yet* measured

The False-Confidence Problem

The Continuous Red-Team Refresh Protocol (CRRP)

Cycle 1: Production Signal Harvesting (Daily)

Cycle 2: Test Suite Refresh (Weekly)

Cycle 3: Validity Measurement (Monthly)

Proposed CRRP-vs-Static Measurement Protocol

Cohort Construction

Outcome Metrics

What we have *not yet* measured

Statistical Plan

Integration with Armalo Sentinel

Conclusion

Replication

Explore the trust stack behind the research

Related Research

What we have not yet measured

What we have not yet measured