Where is this research published?

Armalo Labs Technical Series — https://www.armalo.ai/labs/research/2026-05-18-jury-eval-failure-analysis. The paper is publicly available and citable.

The Jury Problem: LLM-as-Judge Evaluators Fail 62.4% of Checks While Safety Checks Pass at 94.7%

Q: What is the paper "The Jury Problem: LLM-as-Judge Evaluators Fail 62.4% of Checks While Safety Checks Pass at 94.7%" about?

We analyze 8,726 AI agent eval checks across 17 categories, finding a 17-category stratification that spans from 100% pass rates (safety_check, prompt_injection_check, pii_check, output_format, data_exfiltration) to near-zero pass rates (hallucination_check: 0%, heuristic: 7.1%, red-team: 23.8%). Most critically, jury checks — evaluations performed by an LLM-as-judge consensus mechanism — fail 62.4% of the time, making them simultaneously the most informative and least reliable evaluation category. Reliability checks (68.4% pass rate) represent the highest-volume failing category, accounting for 38.7% of all failures despite being the second-largest category by execution count. We argue that the failure rate distribution reveals structural properties of the evaluation landscape: categories with algorithmic evaluation criteria converge on high pass rates; categories that require holistic judgment (jury, red-team, heuristic) fail frequently and provide the most useful signal about genuine behavioral gaps. All data reproducible from the committed measurement producer.

Evaluation of AI agent behavior is not a single activity — it is a family of activities with radically different reliability characteristics. A safety check that verifies the agent's output doesn't contain a known prohibited phrase operates on algorithmic criteria that are easy to evaluate and produce consistent results. A jury check that asks three independent LLM judges whether an agent's decision was wise and well-reasoned operates on holistic criteria that are difficult to evaluate and produce inconsistent results.

Understanding the failure rate distribution across evaluation categories is essential for interpreting what an agent's overall eval score means. A high score from safety checks carries different information than a high score from jury checks. This paper reports the first empirical characterization of that distribution from a production agent evaluation system.

1. Dataset Overview

We analyze 8,726 eval checks executed between 2026-05-11 and 2026-05-18 across 1,352 evaluations covering 159 agents. The checks span 17 distinct categories.

Aggregate statistics:

Pass rate: 82.39% (7,189 passed, 1,537 failed)
Mean execution duration: 324ms per check
p50 execution time: 2,800ms
p95 execution time: 8,438ms

The 324ms mean vs 2,800ms median reveals a right-skewed duration distribution: most checks complete in seconds, but a tail of expensive checks (likely LLM-backed jury evaluations) extends to the p95 and beyond.

2. Pass Rate by Category

The 17-category breakdown shows the full range of evaluation reliability:

Category	Total	Passed	Pass Rate
safety_check	16	16	100.0%
prompt_injection_check	10	10	100.0%
pii_check	8	8	100.0%
output_format	4	4	100.0%
data_exfiltration	4	4	100.0%

3. The Five Categories That Never Fail

Five categories show a 100% pass rate: safety_check, prompt_injection_check, pii_check, output_format, data_exfiltration. These are the smallest categories (4–16 checks each) with a common structural property: they evaluate against algorithmically deterministic criteria. Either the output contains a prohibited pattern or it does not. Either the format is correct or it is not.

The interpretation is not that AI agents are perfect at these tasks. The interpretation is that these are the categories where the system deploys agents it is most confident won't fail — the selection effect is real. Agents subjected to red-team checks and jury evaluations are being tested against harder standards.

4. The Reliability Gap

The single most important finding in this dataset is not the jury failure rate — it is the reliability category, which accounts for:

21.6% of all checks (1,884 of 8,726)
38.7% of all failures (595 of 1,537 total failures)
A pass rate of 68.4%

Reliability is the category with the worst combination of high volume and low pass rate. Its 595 failures dwarf the next-highest failure category (jury: 482 failures) in absolute terms.

Reliability checks test whether agents produce consistent outputs for similar inputs, maintain stable behavior across execution contexts, and don't exhibit erratic performance variance. A 68.4% pass rate for reliability means that roughly 1 in 3 reliability checks detects an inconsistency or instability in agent output.

This is a deeper problem than confabulation or safety violations. An agent that confabulates occasionally can be monitored and corrected. An agent that produces inconsistent outputs for equivalent inputs is fundamentally unreliable in a way that is hard to bound. Users and operators cannot develop calibrated expectations for its behavior.

5. The Jury Problem in Detail

Jury checks (773 total, 37.6% pass rate, 482 failures) represent a structural challenge for agent evaluation. The jury mechanism convenes multiple LLM evaluators and asks them to render a holistic judgment about agent behavior quality — was the agent's reasoning sound? Was its approach appropriate? Did it exhibit good judgment?

The 62.4% failure rate has two possible interpretations:

Interpretation A: The jury is too strict. The evaluation criteria are calibrated at a standard that most agent behavior cannot meet, and a 62.4% failure rate is expected given the ambitious benchmark.

Interpretation B: The jury is accurate. Most of what gets evaluated by a jury-level mechanism is genuinely suboptimal agent behavior, and the 62.4% failure rate reflects the actual quality distribution.

These interpretations are not mutually exclusive. The jury mechanism is deliberately reserved for interactions that warrant holistic evaluation — if only clearly good behavior were submitted for jury review, the pass rate would be higher. The selection of interactions for jury evaluation may itself introduce a negative quality bias.

What is not ambiguous: the jury category is the most informative evaluation category. Safety checks that pass 94.7% of the time tell you little about what distinguishes good agents from mediocre ones. Jury checks that fail 62.4% of the time are actively identifying behavioral gaps that matter.

6. Red-Team and Heuristic: The Smallest, Hardest Categories

Red-team checks (23.8% pass, 16 of 21 failed) and heuristic checks (7.1% pass, 13 of 14 failed) represent the adversarial tier of evaluation. These checks are designed to find failures, not validate success.

A 23.8% pass rate for red-team checks means that the adversarial agent probing for pact violations and behavioral boundary failures found exploitable weaknesses in roughly 3 out of 4 tested interactions. The fact that 23.8% of red-team checks pass at all indicates that some agent behaviors are resilient even under adversarial conditions.

The 7.1% heuristic pass rate (1 success in 14 checks) reflects the same adversarial selection: heuristic checks are applied where rule-based evaluation is expected to surface known problematic patterns.

7. Architecture Implications

The failure rate stratification reveals a clear architecture for eval system design:

Algorithmic deterministic checks (safety patterns, format validation, PII detection) should be applied early and broadly — they're cheap, fast, and provide high-confidence verdicts on the binary dimensions they cover.

Performance checks (latency, reliability) should be monitored continuously — their moderate failure rates indicate ongoing operational health issues that compound over time.

Jury and adversarial checks are expensive and should be applied strategically — to interactions selected for quality concerns, to agents under certification review, and to behavioral anomalies flagged by monitoring. Their high failure rates make them valuable precisely because of that selectivity.

An agent with a high composite score should have passed a substantial number of jury checks. An agent whose composite score is built entirely from high-pass-rate algorithmic checks has not been subjected to the hardest tests.

Replication

The published measurement artifact named in the claims registry is the reproducibility anchor; reviewers can recompute the aggregates from that artifact without public exposure of internal runner paths.

Raw data: the published measurement artifact.