Failure Taxonomy Beats Raw Failure Rate in Agent Trust
A 4% failure rate can mean two very different things. Serious buyers need to know whether an agent fails loudly, silently, recoverably, or catastrophically.
Most agent trust conversations still collapse into a single shallow metric: success rate.
That sounds sensible until you try to buy or delegate to a real agent.
Two agents can both fail 4% of the time and represent completely different levels of operational risk. One might fail loudly, return a structured error, preserve state, and ask for human review. The other might produce a plausible but wrong output, continue execution, and leave the downstream system to discover the damage later. Both count as "failures" on a dashboard. Only one is survivable in production.
This is why failure taxonomy matters more than raw failure rate.
Averages hide the part buyers actually care about
Most operators do not need the impossible promise of zero failure. They need to understand the shape of failure.
A useful trust surface distinguishes between at least four categories:
- legible failures, where the agent signals uncertainty or refusal clearly,
- recoverable failures, where the failure is bounded and easy to retry,
- silent failures, where the output looks acceptable but is wrong,
- cascading failures, where a bad result contaminates later steps or other agents.
Those categories are not equally dangerous. Silent and cascading failures carry more downside than loud failures because they transfer detection costs onto the buyer. In practice, that is often what people mean when they say an agent feels "untrustworthy." They do not mean it fails at all. They mean it fails in ways that are expensive to notice.
The market is already asking for this distinction
The most useful recent trust conversations around agents have moved past the demo question and into the operations question.
People are asking:
- Does this system fail loudly or cosmetically?
- Does it preserve enough state to recover?
- Does it know when to stop?
- What happens when it is wrong under pressure?
Those are taxonomy questions, not average-rate questions.
A polished marketing page can survive without answering them. A serious marketplace cannot.
Why failure taxonomy changes buying behavior
Raw failure rate helps with broad filtering. It does not help with fit.
A payments workflow may tolerate legible failures and retries but not silent arithmetic mistakes. A research workflow may tolerate lower confidence outputs if provenance is explicit. A customer support workflow may care less about occasional delay than about unauthorized actions.
As soon as you move from "is this agent good?" to "is this agent safe for this job?" the shape of failure becomes more useful than the aggregate percentage.
That also means the right trust surface is contextual. The same agent can be a strong fit for one class of work and a weak fit for another based on how its failures manifest.
Trust should show failure distribution, not just top-line score
This is one of the places where trust infrastructure needs to mature.
The right buyer-facing view is not just a single score or certification label. It should also show:
- the distribution of failure types,
- whether failures were caught internally or externally,
- how often retries succeeded,
- how often human intervention was required,
- whether failure frequency or severity increases under load.
That kind of evidence changes the conversation from brand trust to operational trust.
It also creates incentives for builders to improve the right thing. If the market only rewards a top-line pass rate, teams will optimize for presentation. If the market rewards legibility, bounded risk, and recoverability, teams will harden the system itself.
This is where trust infrastructure becomes useful
Armalo's view is that production trust should be built from evidence, not narration.
That means runtime telemetry, behavioral verification, dispute history, and signed attestations should all feed the trust surface. Not every buyer needs the whole underlying record, but they do need a faithful summary of how the system behaves when things go wrong.
This is also why we think a trust oracle cannot stop at a single number. Numbers compress. Compression is useful. But if the compression removes the distinction between safe failure and dangerous failure, the signal becomes less trustworthy precisely when stakes rise.
A more honest trust question
Instead of asking, "What is this agent's failure rate?" buyers should ask, "What kinds of failure does this agent produce, and what happens next?"
That question is harder to answer, but it is the one that actually prices risk.
The teams that embrace that reality will build stronger agents. The marketplaces that surface it will attract more serious buyers. And the trust systems that encode it will become part of the agent economy's basic infrastructure.
Because in production, failure is not one thing. The market already knows that. Our trust surfaces need to catch up.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.