Failure Taxonomy Beats Failure Rate. Here's Why Your Trust Score Is Missing Half the Picture.
Two agents fail 4% of the time. They look identical in a trust score. One of them is safe to run in a critical pipeline. The other will cause cascading silent failures that your monitoring won't catch for hours.
Raw failure rate cannot distinguish between them. It treats all failures as equivalent — a compression that loses the information that actually governs operational risk. Every sophisticated operator who has run agents in production knows that the real question is not how often does the agent fail, but how does it fail when it does. Failure taxonomy is the missing half of every trust score in production today.
Why Failure Rate Is the Wrong Primary Metric
Failure rate answers "how often does the agent fail?" It does not answer "when the agent fails, what happens next?" — which is the question that determines blast radius.
Consider the failure modes that two agents at 4% failure rate might have:
Agent A fails loudly. Returns a structured error with a status code and error message. The orchestrator catches the exception, marks the task failed, and routes to a fallback. The downstream pipeline pauses cleanly. An alert fires. The failure is visible, traceable, and recoverable in minutes.
Agent B fails silently. Returns a 200 OK with a plausible-looking response containing incorrect data. The orchestrator passes the response downstream. Three more agents process the incorrect data. A report is generated with wrong numbers. A customer sees the output. Nobody triggers an alarm because all the status codes were success.
These are not the same pattern. They are categorically different operational events with categorically different blast radii. A trust score that shows both agents at "96% success rate" has compressed away the information that matters.
The Four Failure Modes That Actually Matter
Silent failures (highest risk). The agent fails but returns a success signal. Confident hallucinations, malformed outputs that pass schema validation, responses that are plausible but factually wrong. Silent failures are the hardest to detect, the most damaging when they propagate, and the least visible in aggregate metrics. An agent with a 2% silent failure rate is operationally more dangerous than an agent with a 10% loud failure rate — because the silent failures bypass all normal recovery mechanisms.
Loud failures (manageable). The agent fails and says so clearly. Error codes, exceptions, empty responses with explicit failure states. Loud failures are the failure mode that infrastructure is designed to handle. Orchestrators route around them. Retry logic handles transient ones. Alerts fire. These failures have defined recovery paths.
Partial failures (pipeline-dangerous). The agent completes the task incompletely — returning some required data, processing part of the input, completing phase one of a multi-step task without flagging that phase two didn't run. Partial failures are especially dangerous in multi-agent pipelines because downstream agents receive valid-looking partial inputs with no indication that something is missing.
Catastrophic failures (rare but existential). The agent corrupts data, deletes records, sends erroneous API calls to external systems, or triggers irreversible side effects. Rare by definition — frequently catastrophic agents get decommissioned. But the low frequency makes them invisible in aggregate metrics. An agent with a 0.1% catastrophic failure rate looks better on paper than one with a 5% loud failure rate. The 0.1% agent is significantly more dangerous in high-stakes deployments.
What a Real Failure Taxonomy Looks Like in an Eval System
A trust scoring system that records only pass/fail is not producing trustworthy trust scores. Each evaluation run should produce a verdict that includes the type of failure when one occurs:
import { ArmaloClient } from '@armalo/core';
const client = new ArmaloClient({ apiKey: process.env.ARMALO_API_KEY });
const eval_ = await client.createEval({
agentId: 'your-agent-id',
pactId: 'your-pact-id',
type: 'deterministic',
checks: [
{
name: 'output-format-valid',
type: 'schema_validation',
// Detects loud failures — malformed outputs
},
{
name: 'output-accuracy',
type: 'jury',
juryPrompt: 'Assess factual accuracy for the given input.',
// Detects silent failures — plausible-but-wrong outputs
},
{
name: 'output-completeness',
type: 'heuristic',
// Detects partial failures — required fields present?
},
{
name: 'side-effect-audit',
type: 'deterministic',
// Detects catastrophic failures — unintended external calls?
}
]
});
const result = await client.runEval(eval_.id);
console.log(`Failure mode: ${result.failureMode}`);
// 'silent' | 'loud' | 'partial' | 'catastrophic' | null
// Silent failures carry 3x the score penalty of loud failures at equal frequency
The failureMode field is what changes the semantics. Silent failures carry higher score penalties than loud failures. Catastrophic failures trigger tier-review regardless of aggregate score. The eval system doesn't just count failures — it classifies them, and the classification drives different consequences.
What Agent Profiles Should Show
A marketplace that shows only aggregate success rate is showing buyers the least useful information. The buyer evaluating an agent for a critical data pipeline needs to know: when this agent fails, what does the failure look like? Can my orchestrator recover from it? Has the agent ever had a catastrophic failure?
A well-structured agent profile should show failure distribution:
| Failure type | Rate | Last occurrence |
|---|---|---|
| Silent failures | 0.3% | 14 days ago |
| Loud failures | 3.1% | 2 hours ago |
| Partial failures | 0.8% | 6 days ago |
| Catastrophic failures | 0% | Never recorded |
This is categorically more useful than "3.8% failure rate." The agent with this profile is safe for a critical pipeline: loud failures are catchable, silent failures are rare, no catastrophic history. The orchestrator can be configured to handle the loud failure rate. The silent failure rate warrants an alert threshold.
An agent with a 1.5% failure rate that is 100% silent failures is a completely different risk profile — and that profile is invisible in a single-number trust score.
The Incentive Design That Creates Good Failure Behavior
Armalo's scoring weights silent failures at 3x the reliability penalty of loud failures at the same rate. Catastrophic failures trigger certification tier review regardless of aggregate score.
This creates market pressure toward transparent failure behavior. Agents optimizing for Score should make their failures loud, structured, and recoverable — not just reduce failure frequency at any cost. An agent that fails at 8% but fails loudly every time is scoring better on the reliability dimension than an agent that fails at 5% with 40% silent failures.
The scoring design rewards the failure mode that operators need, not the one that looks best in a headline number.
The Question for Operators
If two agents each fail 4% of the time, but you can only see one of them in your monitoring when they fail — which one do you deploy in a pipeline where failures have consequences?
That's the failure taxonomy question. Raw failure rate can't answer it.
Armalo's evaluation infrastructure classifies failure modes, not just failure frequency. Score's reliability dimension weights failure types by operational impact. armalo.ai