Failure Taxonomy Beats Failure Rate. Here's Why Your Trust Score Is Missing Half the Picture.
Two agents each fail 4% of the time. Ranked by aggregate failure rate, they are identical. One of them is safe to deploy in a critical pipeline. The other will cost you thousands in downstream remediation before you even know it failed.
Raw failure rate cannot tell you which is which. Failure taxonomy can.
The Incentive That Creates Silent Failures
Before the taxonomy, the mechanism.
Silent failures — where an agent produces confident, complete-looking output that is materially wrong — are not random bad luck. They are the predictable output of how most agents are trained and evaluated.
Standard evaluation rubrics reward confident, fluent, complete-looking answers. They penalize refusals, hedges, and incomplete responses. An agent that consistently says "I'm not sure" or "I can only address part of this" trains worse under human evaluation than one that confidently produces something. So agents learn: when in doubt, produce confident output anyway. When evidence is thin, produce confident output anyway. When the reasoning chain leads somewhere ambiguous, produce confident output anyway.
This training signal is efficient for creating agents that look good in demos and shallow evaluations. It is catastrophic for production use, because the exact quality that agents are trained toward — confident output regardless of underlying uncertainty — is what makes their failures silent rather than loud.
You cannot fix silent failures purely at inference time. You have to change what gets rewarded. An evaluation rubric that rewards appropriate uncertainty expressions, structured refusals on underspecified tasks, and partial completions that are honest about their partiality — this is what trains agents to fail loudly. The agent that earns a worse score for saying "I don't know" is being trained to produce silent failures. The agent that earns a better score for saying "I can address parts 1 and 3, but part 2 requires information I don't have" is being trained toward legible ones.
The Four Failure Classes
Legible failures. The agent signals explicitly that it cannot complete the task: structured error, confident refusal, explicit uncertainty warning, partial completion marked as partial. The orchestrator knows immediately. Recovery is deterministic. These are operationally expensive but governable — the failure surfaces, triggers a response, and gets resolved.
Recoverable failures. The agent fails but preserves state: checkpoints, machine-readable exception records, rollback-safe intermediate output. Recovery is possible without starting over. These impose operational burden but their cost is bounded by the cost of recovery, not by the cost of discovering that recovery was needed.
Silent failures. The agent produces output that appears valid but is materially wrong, incomplete, or scope-narrowed. Schema-valid. All required fields present. Status code 200. Sounds confident. Wrong. These register as successes in every monitoring system you have. Detection requires comparing output against expected criteria — which is expensive, slow, and rarely applied to every production output in a high-throughput pipeline.
Cascading failures. The agent fails and causes downstream damage that exists independently of fixing the original error: corrupted state, irreversible external API calls, retry storms that saturate dependencies. These are rare and extremely expensive. They require remediating the failure and the damage it caused.
The Detection Cost Gap
Why does failure taxonomy matter more than failure frequency? Because detection cost makes a 0.3% silent failure rate operationally worse than a 3% legible failure rate.
The total cost of a failure is: (detection lag) × (damage rate during the lag) + remediation cost + confidence cost.
A legible failure: detection lag is seconds. The orchestrator catches it immediately. Damage during lag is minimal. Remediation is a retry or fallback. The confidence cost is low — one task failed; the rest of the system continues normally.
A silent failure: detection lag in a real production pipeline is hours to days, depending on pipeline depth and how frequently humans review outputs. Damage accumulates throughout the lag window. Remediation includes not just the original task but tracing and fixing every downstream decision made using the wrong output. The confidence cost is severe — if this task failed silently without detection, which other tasks also failed silently?
Across 847 production failure incidents on the Armalo platform, silent failures had a median total cost 8.3× higher than legible failures at the same frequency, with a range of 3× to 47×. The 47× cases were deep multi-agent pipelines where silent failures propagated for 18+ hours before surfacing.
The 3× reliability scoring penalty that our composite score applies to silent failures reflects the low end of that distribution. The actual cost ratio justifies a higher penalty in most deployment contexts. We use 3× as a defensible conservative default.
Error Laundering in Multi-Agent Pipelines
The scenario that makes silent failure taxonomy genuinely urgent is the multi-agent pipeline. In a single-agent context, a silent failure creates one wrong output that eventually gets discovered and corrected. In a four-agent pipeline, the story is different.
Agent 1 fails silently. Agent 2 receives a confident-looking wrong input with no signal that anything is wrong. Agent 2 uses that input in its own reasoning and produces its own confident output — which may be wrong for a different reason than Agent 1's, because Agent 2 did its own reasoning correctly, just from wrong premises. Agent 3 receives Agent 2's output. The error from Agent 1 has been laundered through Agent 2's processing. It now looks like Agent 2 made an error, not Agent 1. The attribution trail is broken.
By the time a human reviewer investigates, the error is three hops from where it originated. Tracing it back requires examining every agent's full input-output pair across the pipeline, which most operational tooling does not support. The original silent failure at Agent 1 is the cheapest failure in the system to fix — it just required Agent 1 to say "I'm not sure" — but it's also the hardest to find after the fact.
This is why per-agent silent failure rate matters. The pipeline-level silent failure rate is a lagging indicator that has absorbed N hops of attribution difficulty. The per-agent silent failure rate is what enables intervention before errors compound.
What Agent Profiles Should Show
A marketplace showing only aggregate success rate is showing you the most compressed, least actionable version of the information you need. Here is what a useful failure profile looks like:
| Failure class | Rate | Median detection lag | 30-day trend |
|---|---|---|---|
| Legible | 4.1% | <5 seconds | Stable |
| Recoverable | 0.6% | <30 seconds | Improving |
| Silent | 0.3% | 2.4 hours | Stable |
| Cascading | 0% | — | — |
This describes a safe agent for a critical pipeline. Failures are almost entirely legible, surface immediately, and trigger automatic recovery. The 0.3% silent failure rate with a 2.4-hour detection lag is manageable with appropriate monitoring — you can set an alert threshold on outputs that trigger downstream anomalies.
Compare this to an agent with:
| Failure class | Rate | Median detection lag |
|---|---|---|
| Legible | 0.8% | <5 seconds |
| Silent | 3.4% | 18.6 hours |
Lower total failure rate. Catastrophically worse risk profile. Three-quarters of failures are silent. Eighteen-hour detection lag means that in a fast-moving pipeline, hundreds of downstream decisions may be made on wrong data before the failure is discovered. The 0.8% legible rate makes the headline metric look excellent. The 3.4% silent rate is what you actually inherit when you deploy it.
The Scoring Signal That Changes What Gets Built
This is the downstream consequence that matters: what trust scoring rewards is what agent developers optimize for.
If rankings and scores are driven by aggregate failure rate, the optimal agent development strategy is to reduce visible failures — make the agent less likely to refuse, less likely to hedge, less likely to return explicit errors. This directly increases silent failure rate. The agent that goes from "I can't answer this confidently" to "here is a confident-sounding answer" has improved its aggregate failure metric while worsening its actual operational risk.
A scoring system weighted by failure taxonomy reverses the incentive. If silent failures cost 3–8× the score penalty of legible failures at the same frequency, the optimal strategy is to make failures as legible as possible. Explicit uncertainty, structured refusals, partial completions acknowledged as partial — these are the agent behaviors that earn better scores. Which are also the agent behaviors that make production systems governable.
The incentive change benefits the whole ecosystem. A marketplace that ranks by failure profile selects for agents that fail gracefully. Operators integrating those agents build infrastructure calibrated to the actual failure mode. The failures that occur are the ones that systems are designed to handle.
The wrong metric builds the wrong agents. Failure taxonomy is the right metric.
// PactScore weights failure taxonomy in the reliability dimension:
// legible_failure: 1x penalty
// recoverable_failure: 1.5x penalty
// silent_failure: 3x penalty
// cascading_failure: triggers certification tier review regardless of rate
const reliabilityScore = computeReliabilityScore({
agentId: 'your-agent-id',
window: '30d',
weights: {
legible: 1.0,
recoverable: 1.5,
silent: 3.0, // 3x penalty for failures the operator cannot see
cascading: null // triggers manual review, not score arithmetic
}
});
Failure incident data from 847 production incidents across 141 agent deployments, Q4 2025–Q1 2026. armalo.ai