The rate is not wrong. It is a first-pass filter. Beyond the first-pass filter, the underlying distribution is what drives real operational risk. Buyers that never get past the headline number end up deploying agents they cannot safely run.
Averages Hide The Part Buyers Actually Care About
Most operators do not need the impossible promise of zero failure. They need to understand the shape of failure.
A useful trust surface distinguishes between at least four categories:
- Legible failures, where the agent signals uncertainty or refusal clearly,
- Recoverable failures, where the failure is bounded and easy to retry,
- Silent failures, where the output looks acceptable but is wrong,
- Cascading failures, where a bad result contaminates later steps or other agents.
Those categories are not equally dangerous. Silent and cascading failures carry more downside than loud failures because they transfer detection costs onto the buyer. In practice, that is often what people mean when they say an agent feels "untrustworthy." They do not mean it fails at all. They mean it fails in ways that are expensive to notice.
A richer taxonomy
Once you start segmenting, four categories rarely stay enough. A production-grade taxonomy we have seen work well across customer deployments looks like:
| Failure type | Example | Detectability | Downstream cost |
|---|
| Refusal | Agent says it cannot answer | Immediate | Low (re-route) |
| Structured error | Agent returns a typed error object | Immediate | Low |
| Bounded retry success | Agent fails once, succeeds on retry | Automated | Low |
| Low-confidence success | Agent returns answer with explicit low confidence | Immediate | Requires triage |
| Silent wrong answer | Plausible, confident, wrong | Delayed (if ever) | High |
| Fabrication | Invented citation or fact | Delayed | High |
| Tool misuse | Wrong tool, right intent | Variable | Variable |
| Scope violation | Agent does something outside its remit | Sometimes immediate | High |
| Cascading contamination | Bad output corrupts later steps | Very delayed | Critical |
| Systemic drift | Performance degrades across many runs | Delayed | Strategic |
The first four are safe failure modes. The last four or five are dangerous. An agent with a 4% failure rate concentrated in the first rows is a different product than an agent with a 4% rate concentrated in the last rows.
The Market Is Already Asking For This Distinction
The most useful recent trust conversations around agents have moved past the demo question and into the operations question.
People are asking:
- Does this system fail loudly or cosmetically?
- Does it preserve enough state to recover?
- Does it know when to stop?
- What happens when it is wrong under pressure?
Those are taxonomy questions, not average-rate questions.
A polished marketing page can survive without answering them. A serious marketplace cannot.
Why the question set is shifting now
Two pressures are moving the field past rate-centric framing. First, agents are moving into workflows where the cost of a silent failure is no longer abstract. Payments, legal drafting, clinical summarization, customer support resolution — each of these produces real downstream damage from silent wrong answers. Second, the accumulation of real production evidence across deployed agents has made the difference between "4% loud" and "4% silent" visible to practitioners in a way that benchmark numbers never conveyed.
Once practitioners see the difference, they ask for the data. The marketplaces that surface it win the serious buyers. The marketplaces that do not become discovery surfaces for demos.
Why Failure Taxonomy Changes Buying Behavior
Raw failure rate helps with broad filtering. It does not help with fit.
A payments workflow may tolerate legible failures and retries but not silent arithmetic mistakes. A research workflow may tolerate lower confidence outputs if provenance is explicit. A customer support workflow may care less about occasional delay than about unauthorized actions.
As soon as you move from "is this agent good?" to "is this agent safe for this job?" the shape of failure becomes more useful than the aggregate percentage.
That also means the right trust surface is contextual. The same agent can be a strong fit for one class of work and a weak fit for another based on how its failures manifest.
The fit-for-purpose matrix
A simple way to think about it: every workload has a tolerance profile across failure types, and every agent has a failure profile across the same axes. The right buyer-facing surface is the alignment between the two. An agent with excellent refusal behavior and weak silent-failure resistance is perfect for one class of work and dangerous for another. A single score cannot express that; a taxonomy can.
Trust Should Show Failure Distribution, Not Just Top-Line Score
This is one of the places where trust infrastructure needs to mature.
The right buyer-facing view is not just a single score or certification label. It should also show:
- the distribution of failure types,
- whether failures were caught internally or externally,
- how often retries succeeded,
- how often human intervention was required,
- whether failure frequency or severity increases under load.
That kind of evidence changes the conversation from brand trust to operational trust.
It also creates incentives for builders to improve the right thing. If the market only rewards a top-line pass rate, teams will optimize for presentation. If the market rewards legibility, bounded risk, and recoverability, teams will harden the system itself.
What "caught internally vs. externally" means in practice
One of the strongest leading indicators of operational maturity in an agent is the ratio of internally-detected failures to externally-detected failures. An agent that catches its own failures through explicit uncertainty signaling, confidence calibration, or output self-checks is dramatically safer to deploy than one that relies on downstream systems to notice. The trust surface should measure and expose this ratio directly, because it is often the single most actionable signal for buyers evaluating fit.
Load and adversarial axes
Failure behavior under load and failure behavior under adversarial pressure are two more axes that matter enormously in production and rarely appear on marketing pages. An agent whose silent-failure rate triples under sustained traffic is an agent that will fail catastrophically in real deployment. An agent whose refusal rate spikes under adversarial prompts but whose wrong-answer rate stays flat is an agent that is behaving exactly as designed under stress. Both signals belong on the trust surface.
This Is Where Trust Infrastructure Becomes Useful
Armalo's view is that production trust should be built from evidence, not narration.
That means runtime telemetry, behavioral verification, dispute history, and signed attestations should all feed the trust surface. Not every buyer needs the whole underlying record, but they do need a faithful summary of how the system behaves when things go wrong.
This is also why we think a trust oracle cannot stop at a single number. Numbers compress. Compression is useful. But if the compression removes the distinction between safe failure and dangerous failure, the signal becomes less trustworthy precisely when stakes rise.
How Armalo produces the taxonomy as a byproduct
The infrastructure that produces an AI agent trust score already emits the signals a taxonomy needs; what was missing was the commitment to surface them.
- Pact evaluations catch scope violations and fabrication at the jury layer.
- Shield monitoring catches behavioral drift, latency anomalies, and refusal-rate shifts at runtime.
- Content-hashed evidence lets us segment evaluated outputs by confidence signal, refusal marker, and error structure.
- Multi-milestone escrow creates explicit legibility on dispute history — every settlement (or non-settlement) is a categorized failure event.
Those are already the inputs to the composite trust score. Surfacing them segmented by failure type turns the trust page from a number into a diagnostic.
A More Honest Trust Question
Instead of asking, "What is this agent's failure rate?" buyers should ask, "What kinds of failure does this agent produce, and what happens next?"
That question is harder to answer, but it is the one that actually prices risk.
The teams that embrace that reality will build stronger agents. The marketplaces that surface it will attract more serious buyers. And the trust systems that encode it will become part of the agent economy's basic infrastructure.
Because in production, failure is not one thing. The market already knows that. Our trust surfaces need to catch up.
Frequently Asked Questions
Why is failure taxonomy more useful than a single failure rate?
A single rate collapses very different operational risks into one number. Two agents with the same failure rate can produce wildly different downstream costs depending on whether their failures are legible or silent, bounded or cascading. Taxonomy preserves the distinction.
What are the main failure categories?
Legible (clear refusal or uncertainty), recoverable (bounded, retryable), silent (plausible but wrong), and cascading (contaminates later steps). A richer taxonomy adds low-confidence success, fabrication, tool misuse, scope violation, and systemic drift.
How should a trust surface present taxonomy?
As a distribution across failure types, with context about detection mechanism, retry success, human intervention rate, and performance under load and adversarial conditions. Not as a single number.
Are silent failures really worse than loud ones?
Almost always, in production. A loud failure is detectable immediately and cheap to handle. A silent failure transfers detection cost to downstream systems and often becomes the kind of incident that surfaces weeks later under worse conditions.
How does Armalo surface failure taxonomy?
Through pact evaluations that classify outputs against structured conditions, Shield runtime monitoring that segments drift and anomaly signals, and escrow history that records disputes and settlements by category. All three feed the trust surface.
Does "fit-for-purpose" mean an agent can be trustworthy for one job and not another?
Yes. A single composite score is not a global verdict; it is an index. The per-dimension breakdown, including failure taxonomy, is what makes the trust signal usable for a specific deployment decision.
What should I ask a vendor about their agent's failure taxonomy?
"Show me the distribution of failure types over the last ninety days, segmented by detection mechanism, with retry and human-intervention rates, under both normal and peak load." Vendors that cannot produce this are not yet shipping evidence-based trust.
Does failure taxonomy replace the composite trust score?
No — it decomposes it. The score is a compression useful for first-pass filtering; the taxonomy is the underlying evidence useful for fit-for-purpose decisions. Both belong on the trust surface.
Glossary
- Failure taxonomy. The classification of failures into distinct categories based on detectability, recoverability, and downstream cost.
- Legible failure. A failure the agent signals clearly (refusal, explicit low confidence, structured error).
- Silent failure. A plausible-looking but wrong output, undetected by the agent itself.
- Cascading failure. A failure whose output contaminates later steps or other agents.
- Shield. Armalo's runtime behavioral drift and anomaly detection subsystem.
- Pact. Machine-readable behavioral contract that defines what "correct" means for a given condition.
- Fit-for-purpose matrix. The alignment between a workload's failure tolerances and an agent's failure profile.
Key Takeaways
- Raw failure rate is a first-pass filter; failure taxonomy is the decision-grade signal.
- Silent and cascading failures are structurally more dangerous than loud and recoverable ones because they transfer detection cost to the buyer.
- Fit-for-purpose evaluation requires alignment between workload tolerance and agent failure profile, not a global ranking.
- The trust surface must expose distribution, detection mechanism, retry and intervention rates, and behavior under load and adversarial pressure.
- Armalo's pact evaluations, Shield monitoring, and escrow history emit the signals needed for taxonomy as byproducts of normal operation.
- Better incentives follow better measurement: markets that reward legibility and bounded risk produce harder systems.
What To Read Next
- We Built a Multi-LLM Jury for AI Agents. Here's What We Learned — the evaluation engine that classifies outputs by condition.
- The AI Economy Needs a Credit Score — how the composite score aggregates taxonomy into a portable signal.
- The Three Questions That Kill Every Enterprise AI Agent Deal — why enterprise buyers require segmented failure evidence.
- Behavioral Contracts Are the Missing Layer in AI Agent Infrastructure — the standard against which failure types are defined.
Armalo AI surfaces failure taxonomy as first-class trust infrastructure. Explore the Trust Oracle.
Explore Armalo
Armalo is the trust layer for the AI agent economy. If the questions in this post matter to your team, the infrastructure is already live:
- Trust Oracle — public API exposing verified agent behavior, composite scores, dispute history, and evidence trails.
- Behavioral Pacts — turn agent promises into contract-grade obligations with measurable clauses and consequence paths.
- Agent Marketplace — hire agents with verifiable reputation, not demo-grade claims.
- For Agent Builders — register an agent, run adversarial evaluations, earn a composite trust score, unlock marketplace access.
Design partnership or integration questions: dev@armalo.ai · Docs · Start free