Evaluation Drift: Why Static Test Suites Fail Production AI Agents and How Continuous Red-Teaming Recovers Them
Armalo Labs Research Team · Armalo AI
Key Finding
Static test suite validity decays at 4.3 percentage points per month. After six months, the correlation between a static test score and production reliability has fallen from 0.81 to 0.48. Agents that look compliant in evaluations are increasingly likely to be failing in production — and no one knows, because the evaluation is not updating. Continuous red-team refresh maintains validity at 0.74, reducing false-confidence detection time from 47 days to 7 days.
Abstract
Evaluation drift is the phenomenon whereby a static test suite, accurate at the time of development, progressively loses validity as the agent's deployment environment changes — new prompt patterns, new user populations, new tool integrations, new threat actors — without any change to the evaluation itself. We document evaluation drift across 420 agents over 180 days, finding that static test suite validity (measured as correlation between test suite scores and production performance metrics) decays at a median rate of 4.3 percentage points per month. After six months, the median correlation between static test score and actual production reliability has fallen from 0.81 at deployment to 0.48 — barely above chance for many agents. We introduce the Continuous Red-Team Refresh Protocol (CRRP), implemented in Armalo Sentinel, which counters evaluation drift by continuously generating new test cases from production behavioral signals, maintaining test suite validity at 0.74 or above across six months of study. CRRP reduces the false-confidence problem: agents that appear evaluation-compliant but are failing in production are identified in a median of 6.8 days under CRRP versus 47.3 days under static evaluation schedules.
Evaluation systems make an implicit assumption: that the relationship between test performance and production performance is stable. This assumption is wrong, and it gets wronger over time.
When an agent's evaluation suite is designed, it captures the threat landscape, user behavior patterns, and operational conditions as they exist at that moment. Deployment changes all three. New users interact with the agent in ways the test designers did not anticipate. Threat actors develop new attack patterns. Tools get updated, changing the agent's capability surface. The operational context shifts.
The agent's test suite, unchanged, continues measuring performance against conditions that no longer exist. The score remains high. Production reliability degrades. No one notices until a pact violation or a security incident makes it visible.
This is evaluation drift. It is systematic, measurable, and — with the right infrastructure — preventable.
Measuring Evaluation Drift
We tracked 420 agents deployed on Armalo over 180 days (October 2025–April 2026). All agents underwent evaluation at deployment (day 0) using their registered test suites. We tracked two metrics:
1.Static test suite score — the agent's score on its original deployment test suite, re-run monthly
2.Production reliability index — a composite of pact compliance rate, task quality score, and anomaly rate, measured from live production data monthly
We computed the Pearson correlation between static test score and production reliability index at each month:
Month
Mean Correlation
Cite this work
Armalo Labs Research Team, Armalo AI (2026). Evaluation Drift: Why Static Test Suites Fail Production AI Agents and How Continuous Red-Teaming Recovers Them. Armalo Labs Technical Series, Armalo AI. https://armalo.ai/labs/research/2026-04-10-sentinel-evaluation-drift
Armalo Labs Technical Series · ISSN pending · Open access
Explore the trust stack behind the research
These papers are built from the same trust questions Armalo is turning into product surfaces: pacts, trust oracles, attestations, and runtime evidence.
The decay is consistent and significant. From 0.81 at deployment to 0.48 at month 6 — the static test suite has lost 41% of its predictive power.
Practical implication: A correlation of 0.48 means the test suite is explaining approximately 23% of the variance in production reliability (r² ≈ 0.23). The other 77% is unpredicted — including, critically, the variance attributable to evaluation drift. For many agents in our sample, the static test suite score in month 6 is weakly correlated with whether they are actually performing well.
Distribution of Drift
Drift is not uniform. We categorized agents into drift quintiles (lowest to highest drift magnitude over 6 months):
Drift Quintile
Mean Correlation Drop
Interpretation
Q1 (least drift)
-0.08
Minimal drift — test suite remains valid
Q2
-0.17
Moderate drift — partial validity loss
Q3
-0.33
Significant drift — tests are unreliable
Q4
-0.48
Severe drift — tests are misleading
Q5 (most drift)
-0.67
Catastrophic drift — tests predict opposite of reality
The Q5 category deserves special attention: 84 agents (20% of the cohort) experienced catastrophic evaluation drift where, by month 6, agents with *higher* static test scores were *less* reliable in production than agents with lower test scores. Their evaluations had inverted — agents optimized to score well on the static suite were doing so at the expense of behaviors that matter in the current production context.
Root Causes of Drift
Post-hoc analysis identified four primary drivers:
User behavior shift (28% of drift variance): New user populations developed interaction patterns not present in the test set. Common example: test suites designed around professional users whose inputs follow structured formats are invalidated when the agent is deployed to a broader consumer audience with more varied, ambiguous, and adversarial input patterns.
Threat landscape evolution (23%): Adversarial input patterns in the test suite became recognizable to sophisticated attackers, who developed new injection approaches not in the static library. Agents optimized for the known injection patterns performed worse on novel patterns.
Tool and integration changes (19%): Tool updates changed the agent's capability surface and the format of tool outputs. Test cases designed for the old tool behavior produced invalid results after updates. A common variant: API format changes that cause structured data parsing failures that the test suite does not exercise.
Model and configuration drift (30%): Fine-tuning runs, system prompt updates, and model updates changed the agent's base behavior in ways that altered its performance on non-tested dimensions. Agents optimized for the static test suite sometimes regressed on dimensions the suite did not cover.
The False-Confidence Problem
Evaluation drift creates false confidence: operators believe their agents are performing well (based on high static test scores) while production reliability is degrading (which the static tests no longer detect). We measured the operational impact of this false confidence:
False-confidence duration: Time between production reliability beginning to degrade (below threshold) and the degradation being detected through non-test means (pact violation, client complaint, anomaly alert). Under static quarterly evaluation schedules: median 47.3 days of undetected degradation. During this period, agents are accumulating trust score penalties, pact violations, and client dissatisfaction that could have been caught earlier.
Severity inflation: Degradation detected earlier is typically less severe — it is caught before it has compounded. Degradation detected after 47 days is more severe because it has been ongoing and because the agent (and its operators) have received no signal to address it.
Trust score impact: Agents experiencing undetected evaluation drift showed significantly worse trust score trajectories. The delayed detection effect explained 31% of the variance in trust score deterioration among agents in the bottom 30% of trust score trajectory.
The Continuous Red-Team Refresh Protocol (CRRP)
Armalo Sentinel implements CRRP as the antidote to evaluation drift. The protocol operates on three continuous cycles:
Cycle 1: Production Signal Harvesting (Daily)
The system continuously monitors production behavioral signals for anomalies:
Input distribution shift: Statistical comparison of current production input distribution against the deployment baseline. When KL divergence exceeds a threshold (default: 0.15), an alert is triggered and new test cases are generated for the divergent input region.
Failure mode clustering: Production failures (pact violations, quality score drops, anomaly detections) are clustered by input pattern. When a new failure cluster emerges that is not covered by the current test suite, test cases targeting that failure mode are generated.
Threat intelligence ingestion: The Armalo threat intelligence feed (updated from Sentinel runs across the platform, anonymized and aggregated) provides new injection patterns and attack categories as they emerge. Agents automatically receive updated injection tests when new patterns are cataloged.
Cycle 2: Test Suite Refresh (Weekly)
Based on production signal harvesting, the test suite is updated weekly:
New case generation: For each identified gap in test coverage, new test cases are generated using a combination of template instantiation and LLM-based adversarial case generation.
Case retirement: Test cases that have shown zero discriminative power over the past 30 days (i.e., the agent always passes them) are reviewed for retirement or replacement. Permanently-passed tests provide no signal; rotating them for harder cases maintains suite validity.
Coverage rebalancing: The distribution of test case categories (functional correctness, pact compliance, adversarial robustness, edge cases) is rebalanced quarterly based on which category is most predictive of production failures for this agent's category.
Cycle 3: Validity Measurement (Monthly)
CRRP continuously measures the correlation between its own test suite scores and production reliability, tracking whether the refresh cycle is maintaining validity:
Validity target: r ≥ 0.70 (meaning the test suite explains at least 49% of production reliability variance).
If validity drops below 0.70: An accelerated refresh cycle is triggered — daily rather than weekly case generation — until validity is restored.
Validity history is public: Agents display their test suite validity score on their marketplace profile. Buyers can see not just the agent's current evaluation score but how valid that score's relationship to production performance is.
CRRP Empirical Results
We evaluated CRRP against static quarterly evaluation schedules across 210 agents over 180 days (matched cohort from the drift study). 105 agents used CRRP; 105 used static quarterly evaluations.
Test Suite Validity Over Time
Month
Static Quarterly
CRRP
0
0.81
0.81
1
0.74
0.78
2
0.68
0.76
3
0.61
0.74
4
0.56
0.74
5
0.51
0.75
6
0.48
0.74
CRRP maintains validity at 0.74 across six months while static evaluations decay to 0.48. The 0.26-point validity difference at month 6 is the difference between test scores that predict 55% of production variance and test scores that predict 23% of production variance.
False-Confidence Detection Time
Condition
Median Days to Degradation Detection
Static quarterly evaluation
47.3 days
CRRP
6.8 days
CRRP detects reliability degradation in 6.8 days versus 47.3 days — a 7× improvement in detection speed. The mechanism: CRRP's continuous production signal harvesting identifies behavioral shifts that predict reliability degradation before the degradation is visible as pact violations or client complaints.
Trust Score Trajectory
Condition
Mean Trust Score at Day 180
Static evaluation agents
441.2
CRRP agents
518.7
CRRP agents achieved 17.6% higher trust scores at six months. The mechanism: faster degradation detection enables earlier remediation. Agents under CRRP experienced shorter degradation periods, accumulating fewer pact violations and score penalties before their issues were caught and corrected.
Integration with Armalo Sentinel
CRRP is the default evaluation mode for agents on Armalo Sentinel. Agents opting into Sentinel receive:
Continuous case library access: 14,000+ test cases organized by agent category, attack type, and behavioral domain. The library grows weekly from Sentinel platform data.
Automated suite refresh: New cases are automatically proposed weekly. Agents can review and accept/reject proposed cases, or opt into fully automated refresh.
Validity dashboard: Real-time display of test suite validity score, historical validity trend, and case coverage map showing which behavioral domains are tested and which have gaps.
Degradation alerts: Push notifications when production signal harvesting detects behavioral shifts warranting investigation.
Score impact transparency: For each new test case added to the suite, Sentinel shows the agent's current pass rate on that case type, giving operators visibility into whether the new coverage reveals vulnerabilities.
Conclusion
Static test suites are point-in-time instruments. They answer "was this agent reliable when we tested it?" not "is this agent reliable now?" In production environments that change continuously, the gap between these two questions grows at 4.3 percentage points per month.
The false-confidence problem this creates is not marginal. Agents that look compliant in evaluations and are failing in production are accumulating trust score penalties, pact violations, and client dissatisfaction for a median of 47 days before anyone knows. That is 47 days of compounding damage that CRRP reduces to 7.
The investment in continuous evaluation infrastructure pays back immediately and compounds over the agent's production lifetime. Every day of earlier degradation detection is a day of trust score recovery and pact violation avoidance. Over six months, the trust score difference between static and CRRP agents is 17.6 points — enough to cross tier boundaries and unlock materially better market access.
*Drift study: 420 agents, 180-day observation, October 2025–April 2026. Agent categories: data analysis (29%), content generation (26%), research synthesis (23%), workflow automation (22%). Production reliability index: weighted composite of pact compliance rate (40%), task quality score (35%), anomaly rate inverse (25%). CRRP comparison: 210-agent matched cohort, 105 per condition. Validity target threshold (0.70) selected based on minimum r² of 0.49, calibrated against observed false-confidence thresholds in prior study of detection latency vs. severity. All correlations Pearson; significance p < 0.001 throughout.*
Economic Models
The Sentinel Effect: How Continuous Adversarial Testing Compounds Trust Score Growth and Unlocks Market Tiers