Where is this research published?

Armalo Labs Technical Series — https://www.armalo.ai/labs/research/2026-05-18-eval-check-failure-taxonomy. The paper is publicly available and citable.

Failure Taxonomy Across 8,726 Production Eval Checks: Jury Consensus vs. Rule-Based Safety

Q: What is the paper "Failure Taxonomy Across 8,726 Production Eval Checks: Jury Consensus vs. Rule-Based Safety" about?

We measured 8,726 eval checks across 17 categories in the Armalo production evaluation pipeline, finding an 82.39% aggregate pass rate with a pronounced structural divide between check types. The most striking finding is an 11.8x failure-rate gap between jury consensus checks (62.4% failure) and safety rule checks (5.3% failure), driven by the fundamental difficulty of achieving LLM consensus rather than any deficiency in agent behavior. Reliability checks represent the opposite problem: a 68.4% pass rate that is mediocre by itself but accounts for 595 of the 1,537 total failures — 38.7% of all failures by volume — making reliability calibration the single largest operational bottleneck. Together, these findings reveal that eval check design, not agent capability, explains most of the observed failure distribution and should be the primary lever for improving platform-wide evaluation quality.

Introduction

Eval checks are the atomic unit of the Armalo trust pipeline. Each check evaluates a specific behavioral claim about an agent — that it responded safely, that its output format was correct, that an LLM jury reached consensus on its answer quality — and returns a pass or fail result that feeds into the composite trust score. Pacts define which checks apply to which agents; the evaluation engine runs those checks against real agent outputs and accumulates the results.

Because checks are heterogeneous — spanning automated rule evaluation, multi-model LLM jury consensus, latency measurement, and red-team adversarial probing — their failure modes are equally heterogeneous. An eval pipeline that treats all check failures as equivalent events misunderstands its own signal. A jury check that fails because three LLMs disagreed on a nuanced safety boundary is structurally different from a latency check that fails because the agent's response time exceeded a threshold.

A production failure taxonomy serves two purposes. First, it reveals where the pipeline is well-calibrated versus where calibration is the problem rather than agent behavior. Second, it directs investment: the categories with the highest absolute failure volume deserve infrastructure investment, while categories with structurally high failure rates may need redesign of the check itself rather than pressure on the agents being evaluated.

This paper presents a complete failure taxonomy across 8,726 production eval checks, identifies the key structural divides, and draws out the implications for check design and calibration strategy.

Section 1: Measurement Design

An eval check in Armalo is a discrete assertion against an agent output. It has a category label that describes what type of claim is being evaluated, a pass or fail outcome, and a duration recording how long the check took to complete. The 17 categories in this dataset span the full range of check types deployed in production:

Rule-based deterministic checks: safety_check, prompt_injection_check, pii_check, toxicity_check, output_format, data_exfiltration, hallucination_check, deterministic. These run pattern-matching, schema validation, or classifier inference against fixed rules. Pass or fail is fully determined by the rule.
LLM jury checks: jury. These send agent outputs to multiple LLM judges and require consensus across a threshold of judges to pass. Failure means the judges disagreed or the majority ruled against.
Performance checks: latency. These measure whether response time falls within a defined bound.
Adversarial checks: red-team. These probe agents with crafted inputs designed to elicit failures.
Behavioral and aggregate checks: safety, reliability, behavioral, accuracy. These composite categories aggregate multiple underlying assertions about agent conduct over a sample of interactions.
Heuristic and unclassified: heuristic, unknown. These represent checks without a structured category assignment.

Pass is defined as the check assertion being satisfied; fail is defined as the assertion not being satisfied. Duration is measured from check invocation to result return.

Section 2: Aggregate Results

Across 8,726 total checks, 7,189 passed and 1,537 failed, yielding an aggregate pass rate of 82.39%.

The duration distribution reveals a measurement worth investigating: the mean duration is 324.0ms, but the p50 is 2,800.0ms, and the p95 reaches 8,437.6ms. A mean substantially below the median is unusual — it typically indicates a large number of very fast completions pulling the mean down, coexisting with a long tail of slow completions. In this case, the most likely explanation is that deterministic rule-based checks (which complete in single-digit milliseconds) and LLM jury checks (which wait for multiple model invocations and consensus evaluation) co-exist in the same population. The mean is misleading as a planning number; the p50 and p95 are the operationally relevant figures for latency budgeting.

Section 3: Failure Taxonomy

The table below presents all 17 categories sorted by failure rate (highest to lowest), which is the order most relevant for calibration work.

Category	Total	Passed	Failed	Pass Rate
unknown	14	0	14	0.0%
hallucination_check	2	0	2	0.0%
heuristic	14	1	13	7.1%
red-team	21	5	16	23.8%
jury	773	291	482	37.6%

The taxonomy immediately stratifies into three clusters: categories with effectively zero failures (the five 100%-pass categories), categories with moderate failure rates anchored by reliability and latency, and a high-failure cluster anchored by jury, red-team, heuristic, and the small unclassified categories.

Section 4: The Jury Gap

The most structurally important finding is the gap between jury checks and safety rule checks: 37.6% versus 94.7% pass rate, a difference of 11.8x in failure rate (62.4% vs. 5.3%).

This gap does not indicate that agents evaluated by jury checks are 11.8x worse than agents evaluated by safety rules. It indicates that the two check types are measuring fundamentally different things with fundamentally different difficulty profiles.

Safety rule checks pass or fail against a deterministic criterion. If the output does not contain a prompt injection pattern, the prompt injection check passes. There is no ambiguity, no judgment call, and no disagreement possible between evaluators. A well-designed rule that covers the space of violations it was written to detect will produce a high pass rate for compliant agents by construction.

Jury checks require consensus among multiple LLM judges on qualitative claims — whether an answer was accurate enough, whether the reasoning was sound, whether the response was appropriately calibrated. Multi-model consensus on qualitative claims is structurally hard. Different models have different priors about what counts as adequate accuracy or appropriate nuance. Even on cases where most human experts would agree, LLM judges diverge enough that consensus thresholds suppress the pass rate. The 62.4% failure rate on jury checks largely reflects this inherent difficulty rather than a population of agents consistently failing qualitative evaluation.

This distinction has a direct implication for calibration: the jury failure rate should be interpreted as a signal about jury configuration (consensus threshold, judge selection, rubric specificity) before it is interpreted as a signal about agent quality. A 37.6% pass rate on jury checks in a production pipeline is a calibration finding, not necessarily a verdict on the agents being evaluated.

Section 5: Reliability as the Volume Bottleneck

Jury checks present the most extreme failure rate among categories with meaningful sample sizes, but reliability checks present the most consequential operational problem in absolute terms.

With 1,884 total checks and 595 failures, reliability represents 38.7% of all 1,537 failures in the dataset — more than one in three failures platform-wide. The 68.4% pass rate is neither excellent nor catastrophic, but the volume means that reliability calibration has the highest expected return on investment for reducing aggregate failures.

Reliability checks measure whether agents consistently deliver on the behavioral promises encoded in their pacts: response time within bounds, task completion rate, output consistency across similar inputs. A 68.4% pass rate here indicates that a meaningful fraction of agents are failing to meet the reliability standards they have claimed — or, alternatively, that the reliability thresholds are set tighter than the current agent population can consistently achieve. Distinguishing these two explanations requires per-agent reliability trend analysis, but at the platform level the 595 failures establish the scale of the problem.

Section 6: Well-Performing Categories

Five categories achieved 100% pass rates: safety_check, prompt_injection_check, pii_check, output_format, and data_exfiltration. These share a structural property: they are all narrow, deterministic, and binary. Each tests whether a specific type of content is present or absent in the output — a phrasing that triggers a prompt injection pattern, a PII data structure, a format deviation, a data exfiltration signature.

The 100% pass rate is not evidence that the agents being evaluated are perfect; it is evidence that the checks are well-matched to what compliant agents naturally produce. Agents registered on Armalo have already been onboarded through pact definition, which creates a selection effect: agents that cannot satisfy basic safety and format requirements are unlikely to register and execute checks in the first place.

The two broader categories behavioral (94.8%) and accuracy (95.0%) achieve similarly high pass rates at substantially larger sample sizes (1,740 and 1,060 checks respectively), which is more meaningful evidence of genuine platform quality at these dimensions than the small-sample 100% categories.

Section 7: Implications for Eval Design

The taxonomy points to three distinct intervention tracks.

Recalibrate jury checks. A 62.4% failure rate on jury is almost certainly a configuration issue before it is an agent quality issue. The immediate diagnostic is to examine jury consensus threshold, judge selection diversity, and rubric specificity for the check types generating the most failures. Lowering the consensus threshold from, say, 4-of-5 to 3-of-5 judges, or tightening the rubric to reduce evaluator ambiguity, will lower the failure rate without any change in agent behavior. The goal is a jury configuration that reliably distinguishes genuinely poor outputs from outputs that merely expose judge disagreement on difficult cases.

Prioritize reliability infrastructure. 595 failures at 68.4% pass means reliability is the largest single lever for moving the aggregate pass rate. This warrants dedicated investigation: which agents are failing reliability checks, on which dimensions, and whether the thresholds are appropriately calibrated to what the platform can currently guarantee. Even a 5-point improvement in reliability pass rate would eliminate roughly 94 failures and move the aggregate from 82.39% toward 83.5%.

Treat `heuristic` and `unknown` as debt. The 7.1% pass rate on heuristic and 0.0% on unknown are not primarily failure signals — they are signals that these categories need reclassification or replacement with properly designed checks. Checks without a structured type cannot be calibrated, cannot be compared across agents, and cannot produce actionable signal. Migrating these 28 checks to explicit categories is a low-effort, high-leverage cleanup.

Section 8: Limits and Future Work

This taxonomy reflects a single measurement snapshot and cannot distinguish between a failure that recurs consistently for a given agent and a failure that appeared once in a single evaluation run. Per-agent failure rates across time would be a substantially richer signal. Second, the sample sizes for several categories (notably red-team at 21 checks and the five 100%-pass categories at 4–16 checks each) are too small to draw distributional conclusions; the findings for those categories are provisional until larger samples accumulate. Third, the duration analysis requires category-level breakdown — mean and percentile duration collapsed across all 17 category types obscures what is almost certainly a bimodal distribution between fast rule checks and slow jury checks. Separating these two populations in the duration analysis is the most important methodological extension for future measurement.

Future work should also examine the interaction between failure rates and pact coverage: whether agents with more comprehensive pacts (more check types active) exhibit different failure rates than agents with narrow pact coverage, and whether pact specificity predicts calibration quality in the higher-failure categories.

Replication

All measurements in this paper were produced by a single deterministic measurement script against the Armalo production database. Results can be reproduced by running:

The published measurement artifact named in the claims registry is the reproducibility anchor; reviewers can recompute the aggregates from that artifact without public exposure of internal runner paths.

Raw output is committed at the published measurement artifact. The measurement queries the eval_checks table, groups by category, and computes pass rate and duration percentiles. No sampling was applied; the 8,726 check population reflects all checks with a recorded outcome in the database at time of measurement.