Evaluation of AI agent behavior is not a single activity โ it is a family of activities with radically different reliability characteristics. A safety check that verifies the agent's output doesn't contain a known prohibited phrase operates on algorithmic criteria that are easy to evaluate and produce consistent results. A jury check that asks three independent LLM judges whether an agent's decision was wise and well-reasoned operates on holistic criteria that are difficult to evaluate and produce inconsistent results.
Understanding the failure rate distribution across evaluation categories is essential for interpreting what an agent's overall eval score means. A high score from safety checks carries different information than a high score from jury checks. This paper reports the first empirical characterization of that distribution from a production agent evaluation system.
1. Dataset Overview
We analyze 8,726 eval checks executed between 2026-05-11 and 2026-05-18 across 1,352 evaluations covering 159 agents. The checks span 17 distinct categories.
Aggregate statistics:
- Pass rate: 82.39% (7,189 passed, 1,537 failed)
- Mean execution duration: 324ms per check
- p50 execution time: 2,800ms
- p95 execution time: 8,438ms
The 324ms mean vs 2,800ms median reveals a right-skewed duration distribution: most checks complete in seconds, but a tail of expensive checks (likely LLM-backed jury evaluations) extends to the p95 and beyond.
2. Pass Rate by Category
The 17-category breakdown shows the full range of evaluation reliability:
| Category | Total | Passed | Failed | Pass Rate |
|---|---|---|---|---|
| safety_check | 16 | 16 | 0 | 100.0% |
| prompt_injection_check | 10 | 10 | 0 | 100.0% |
| pii_check | 8 | 8 | 0 | 100.0% |
| output_format | 4 | 4 | 0 | 100.0% |
| data_exfiltration | 4 | 4 | 0 | 100.0% |
3. The Five Categories That Never Fail
Five categories show a 100% pass rate: safety_check, prompt_injection_check, pii_check, output_format, data_exfiltration. These are the smallest categories (4โ16 checks each) with a common structural property: they evaluate against algorithmically deterministic criteria. Either the output contains a prohibited pattern or it does not. Either the format is correct or it is not.
The interpretation is not that AI agents are perfect at these tasks. The interpretation is that these are the categories where the system deploys agents it is most confident won't fail โ the selection effect is real. Agents subjected to red-team checks and jury evaluations are being tested against harder standards.
4. The Reliability Gap
The single most important finding in this dataset is not the jury failure rate โ it is the reliability category, which accounts for:
- 21.6% of all checks (1,884 of 8,726)
- 38.7% of all failures (595 of 1,537 total failures)
- A pass rate of 68.4%
Reliability is the category with the worst combination of high volume and low pass rate. Its 595 failures dwarf the next-highest failure category (jury: 482 failures) in absolute terms.
Reliability checks test whether agents produce consistent outputs for similar inputs, maintain stable behavior across execution contexts, and don't exhibit erratic performance variance. A 68.4% pass rate for reliability means that roughly 1 in 3 reliability checks detects an inconsistency or instability in agent output.
This is a deeper problem than confabulation or safety violations. An agent that confabulates occasionally can be monitored and corrected. An agent that produces inconsistent outputs for equivalent inputs is fundamentally unreliable in a way that is hard to bound. Users and operators cannot develop calibrated expectations for its behavior.
5. The Jury Problem in Detail
Jury checks (773 total, 37.6% pass rate, 482 failures) represent a structural challenge for agent evaluation. The jury mechanism convenes multiple LLM evaluators and asks them to render a holistic judgment about agent behavior quality โ was the agent's reasoning sound? Was its approach appropriate? Did it exhibit good judgment?
The 62.4% failure rate has two possible interpretations:
Interpretation A: The jury is too strict. The evaluation criteria are calibrated at a standard that most agent behavior cannot meet, and a 62.4% failure rate is expected given the ambitious benchmark.
Interpretation B: The jury is accurate. Most of what gets evaluated by a jury-level mechanism is genuinely suboptimal agent behavior, and the 62.4% failure rate reflects the actual quality distribution.
These interpretations are not mutually exclusive. The jury mechanism is deliberately reserved for interactions that warrant holistic evaluation โ if only clearly good behavior were submitted for jury review, the pass rate would be higher. The selection of interactions for jury evaluation may itself introduce a negative quality bias.
What is not ambiguous: the jury category is the most informative evaluation category. Safety checks that pass 94.7% of the time tell you little about what distinguishes good agents from mediocre ones. Jury checks that fail 62.4% of the time are actively identifying behavioral gaps that matter.
6. Red-Team and Heuristic: The Smallest, Hardest Categories
Red-team checks (23.8% pass, 16 of 21 failed) and heuristic checks (7.1% pass, 13 of 14 failed) represent the adversarial tier of evaluation. These checks are designed to find failures, not validate success.
A 23.8% pass rate for red-team checks means that the adversarial agent probing for pact violations and behavioral boundary failures found exploitable weaknesses in roughly 3 out of 4 tested interactions. The fact that 23.8% of red-team checks pass at all indicates that some agent behaviors are resilient even under adversarial conditions.
The 7.1% heuristic pass rate (1 success in 14 checks) reflects the same adversarial selection: heuristic checks are applied where rule-based evaluation is expected to surface known problematic patterns.
7. Architecture Implications
The failure rate stratification reveals a clear architecture for eval system design:
Algorithmic deterministic checks (safety patterns, format validation, PII detection) should be applied early and broadly โ they're cheap, fast, and provide high-confidence verdicts on the binary dimensions they cover.
Performance checks (latency, reliability) should be monitored continuously โ their moderate failure rates indicate ongoing operational health issues that compound over time.
Jury and adversarial checks are expensive and should be applied strategically โ to interactions selected for quality concerns, to agents under certification review, and to behavioral anomalies flagged by monitoring. Their high failure rates make them valuable precisely because of that selectivity.
An agent with a high composite score should have passed a substantial number of jury checks. An agent whose composite score is built entirely from high-pass-rate algorithmic checks has not been subjected to the hardest tests.
Replication
node scripts/research-experiments/eval-check-failure-taxonomy-2026.mjsRaw data: apps/web/content/research/data/eval-check-failure-taxonomy-2026.json.