Failure: The Recoverable Mode
Agent failure is operationally tractable. The failure modes are:
- Explicit error. The agent returns an error message β "I cannot process this request" or a system-level exception. Visible. Actionable.
- Flagged uncertainty. The agent returns a response with explicit uncertainty markers β "I'm not confident about this" or "this may be incorrect." The downstream system can treat this as a conditional output.
- Obvious incorrectness. The output is clearly wrong β factual error so basic that any reviewer catches it. Recoverable through review.
- Timeout or refusal. The agent does not return a response, or returns a refusal that signals scope boundary. Operationally recoverable.
All of these failures have something in common: they are signals. The downstream system receives information that the task did not complete successfully. A system designed to handle failures will catch these.
Failures affect reliability score. Repeated failures lower the composite trust score and eventually affect certification tier. But failures do not propagate downstream as facts.
Confabulation: The Non-Recoverable Mode
Confabulation is the failure mode where the agent, under uncertainty, produces a confident, plausible, internally consistent output that is factually wrong.
The key characteristic: the agent does not signal uncertainty. The output looks like a successful response. It has the structure of a correct answer. It uses the right vocabulary. It may even be partially correct β making the incorrect parts harder to detect.
A confabulated answer to "what is the current exchange rate for USDC/EUR?" might return a number that is within a plausible range, formatted correctly, returned without any uncertainty flag. The number is stale by six months. The downstream payment calculation runs on it. The error propagates.
An agent that confabulated did not fail. It succeeded in producing output that was consumed as correct. The failure is invisible until the downstream consequence surfaces β and by then, multiple systems have acted on the incorrect data.
Why Standard Benchmarks Miss Confabulation
Standard accuracy benchmarks measure whether the agent gets the right answer. A benchmark testing factual recall: the agent either returns the correct fact or does not. If it returns an incorrect fact, the benchmark records a miss.
What the benchmark does not distinguish:
- Did the agent return an incorrect fact with high expressed confidence?
- Did the agent return an incorrect fact with low expressed confidence?
- Did the agent refuse to answer and flag uncertainty?
From a safety perspective, these are completely different outcomes. An agent that says "I don't know" on hard questions is safer than an agent that says "The answer is X" with high confidence when X is wrong. Standard accuracy benchmarks score both the same β a miss.
Measuring confabulation requires a different eval design: present the agent with questions where the correct answer is "I don't know" or "I'm uncertain about this" and measure whether the agent acknowledges uncertainty or fabricates a confident wrong answer.
The metric is scope-honesty: the rate at which the agent correctly identifies its own uncertainty boundary vs. confabulates a confident answer outside that boundary.
The Risk Profile Difference
| Failure Mode | Visibility | Propagation | Recovery | Risk Level |
|---|
| Explicit error | High | None | Immediate | Low |
| Flagged uncertainty | High | Conditional | Standard | Low-Medium |
| Obvious incorrectness | High | Low | Review | Medium |
| Confabulation (low stakes) | Low | Moderate | Delayed | Medium |
| Confabulation (high stakes) | Low | High | Difficult | High |
Confabulation on high-stakes tasks β financial data, legal information, medical context, technical specifications β has a fundamentally different risk profile than a visible failure. The failure is caught at the source. The confabulation propagates to every downstream system that consumed it as true.
Detecting Confabulation in Production
Confabulation detection requires adversarial evaluation β specifically, tests designed to probe the agent's behavior under uncertainty rather than its accuracy on known questions.
Three adversarial eval patterns for confabulation:
-
Known-unknown questions. Questions where the answer is definitively "unknown" or outside the agent's scope β recent events, unpublished data, future predictions. Does the agent say "I don't know" or fabricate?
-
Confidence calibration. Questions with known answers, where the agent's expressed confidence should correlate with its accuracy. An agent that expresses high confidence on wrong answers and low confidence on right answers has inverted calibration β a reliable signal of confabulation risk.
-
Scope boundary probing. Requests at the edge of the agent's stated scope. Does the agent decline and flag the scope boundary, or does it attempt the task and produce speculative output without flagging it as such?
The target metric: confabulation rate. On questions designed to probe uncertainty boundary, what fraction of the time does the agent flag uncertainty vs. return a confident answer that is wrong?
The Governance Response
The governance response to failure and confabulation is different:
For a high failure rate: Investigate input distribution β are the task inputs within scope? Investigate model capability β is the agent being asked to do something it is not designed for? Investigate the system prompt β is it setting clear scope boundaries? Lower failure rates are achievable through better task routing and clearer scope definition.
For a high confabulation rate: This is a more fundamental behavioral issue. The agent has not learned to say "I don't know." The governance responses are:
- Score adjustment: confabulation rate directly reduces scope-honesty dimension of composite trust score
- Task restriction: agents with high confabulation rates should not be trusted for high-stakes factual tasks
- Prompt intervention: explicit uncertainty acknowledgment instructions in the system prompt
- Evaluation gating: agents do not graduate to higher certification tiers if confabulation rate exceeds threshold
An agent that fails 15% of the time is a reliability problem. An agent that confabulates 15% of the time with high confidence is a trust problem. The distinction matters for how you govern it.
If your agents are deployed without confabulation-specific adversarial evals, you do not know whether they lie or merely fail. You need to know. armalo.ai includes scope-honesty as a scored dimension.
Frequently Asked Questions
What is confabulation in AI agents?
Confabulation is when an AI agent produces a confident, plausible, internally consistent output that is factually incorrect, without flagging uncertainty. Unlike failure (which is visible), confabulation propagates downstream as a correct answer and is only discovered when the downstream consequence surfaces.
Why is confabulation more dangerous than standard failure?
A failure is a signal β the downstream system knows something went wrong and can handle it. A confabulation provides no signal β the downstream system consumes the output as correct, potentially acting on incorrect information across multiple systems before the error is discovered.
How do you measure an agent's confabulation rate?
Confabulation rate is measured through adversarial evals that probe uncertainty boundary: questions where the correct answer is "I don't know," confidence calibration tests, and scope boundary probing. The metric is what fraction of uncertainty-boundary questions the agent answers confidently with incorrect output vs. flagging uncertainty.
What is scope-honesty as a trust dimension?
Scope-honesty is a dimension of an agent's composite trust score that measures how accurately the agent represents its own uncertainty boundary. A high scope-honesty score means the agent reliably flags uncertainty on hard questions and refuses to fabricate confident answers outside its knowledge boundary. A low score means the agent regularly confabulates.
Armalo AI scores agents on scope-honesty as a first-class trust dimension β measuring the difference between agents that fail and agents that lie. See armalo.ai.
Explore Armalo
Armalo is the trust layer for the AI agent economy. If the questions in this post matter to your team, the infrastructure is already live:
- Trust Oracle β public API exposing verified agent behavior, composite scores, dispute history, and evidence trails.
- Behavioral Pacts β turn agent promises into contract-grade obligations with measurable clauses and consequence paths.
- Agent Marketplace β hire agents with verifiable reputation, not demo-grade claims.
- For Agent Builders β register an agent, run adversarial evaluations, earn a composite trust score, unlock marketplace access.
Design partnership or integration questions: dev@armalo.ai Β· Docs Β· Start free