The output is wrong. The question is which kind of wrong.
An agent that failed encountered a task it could not complete correctly — wrong answer, timed out, returned an error, flagged uncertainty and stopped. Failure is recoverable. You can see it. You can re-run with a different input or a different agent.
An agent that lied returned a plausible, confident, wrong answer without any signal that it was uncertain. The downstream system consumed the output as correct. The error propagated. By the time the lie was discovered, it had been acted on.
These are different failure modes. They require different detection mechanisms. They carry different risk profiles. And they require different governance responses.
TL;DR
- Failure and confabulation are distinct failure modes. Failure is visible — the agent returns an error, flags uncertainty, or produces obviously incorrect output. Confabulation is invisible — the agent returns confident, plausible, wrong output.
- Confabulation is the more dangerous failure mode. Failures are caught. Confabulations propagate downstream as facts.
- Detection requires adversarial evaluation, not accuracy benchmarks. Standard accuracy benchmarks measure correct answers. They do not measure the rate at which the agent produces confident wrong answers.
- The scope-honesty dimension is the key metric. What fraction of the time, when the agent does not know the answer, does it say so vs. fabricate a plausible-sounding answer?
- Different governance responses are required. An agent that fails frequently needs better training or cleaner inputs. An agent that confabulates frequently needs a trust score adjustment, scope restriction, and possibly removal from high-stakes tasks.
Failure: The Recoverable Mode
Agent failure is operationally tractable. The failure modes are:
- Explicit error. The agent returns an error message — "I cannot process this request" or a system-level exception. Visible. Actionable.
- Flagged uncertainty. The agent returns a response with explicit uncertainty markers — "I'm not confident about this" or "this may be incorrect." The downstream system can treat this as a conditional output.
- Obvious incorrectness. The output is clearly wrong — factual error so basic that any reviewer catches it. Recoverable through review.
- Timeout or refusal. The agent does not return a response, or returns a refusal that signals scope boundary. Operationally recoverable.
All of these failures have something in common: they are signals. The downstream system receives information that the task did not complete successfully. A system designed to handle failures will catch these.
Failures affect reliability score. Repeated failures lower the composite trust score and eventually affect certification tier. But failures do not propagate downstream as facts.
Confabulation: The Non-Recoverable Mode
Confabulation is the failure mode where the agent, under uncertainty, produces a confident, plausible, internally consistent output that is factually wrong.
The key characteristic: the agent does not signal uncertainty. The output looks like a successful response. It has the structure of a correct answer. It uses the right vocabulary. It may even be partially correct — making the incorrect parts harder to detect.
A confabulated answer to "what is the current exchange rate for USDC/EUR?" might return a number that is within a plausible range, formatted correctly, returned without any uncertainty flag. The number is stale by six months. The downstream payment calculation runs on it. The error propagates.
An agent that confabulated did not fail. It succeeded in producing output that was consumed as correct. The failure is invisible until the downstream consequence surfaces — and by then, multiple systems have acted on the incorrect data.
Why Standard Benchmarks Miss Confabulation
Standard accuracy benchmarks measure whether the agent gets the right answer. A benchmark testing factual recall: the agent either returns the correct fact or does not. If it returns an incorrect fact, the benchmark records a miss.
What the benchmark does not distinguish:
- Did the agent return an incorrect fact with high expressed confidence?
- Did the agent return an incorrect fact with low expressed confidence?
- Did the agent refuse to answer and flag uncertainty?
From a safety perspective, these are completely different outcomes. An agent that says "I don't know" on hard questions is safer than an agent that says "The answer is X" with high confidence when X is wrong. Standard accuracy benchmarks score both the same — a miss.
Measuring confabulation requires a different eval design: present the agent with questions where the correct answer is "I don't know" or "I'm uncertain about this" and measure whether the agent acknowledges uncertainty or fabricates a confident wrong answer.
The metric is scope-honesty: the rate at which the agent correctly identifies its own uncertainty boundary vs. confabulates a confident answer outside that boundary.
The Risk Profile Difference
| Failure Mode | Visibility | Propagation | Recovery | Risk Level |
|---|
| Explicit error | High | None | Immediate | Low |
| Flagged uncertainty | High | Conditional | Standard | Low-Medium |
| Obvious incorrectness | High | Low | Review | Medium |
| Confabulation (low stakes) | Low | Moderate | Delayed | Medium |
| Confabulation (high stakes) | Low | High | Difficult | High |
Confabulation on high-stakes tasks — financial data, legal information, medical context, technical specifications — has a fundamentally different risk profile than a visible failure. The failure is caught at the source. The confabulation propagates to every downstream system that consumed it as true.
Detecting Confabulation in Production
Confabulation detection requires adversarial evaluation — specifically, tests designed to probe the agent's behavior under uncertainty rather than its accuracy on known questions.
Three adversarial eval patterns for confabulation:
-
Known-unknown questions. Questions where the answer is definitively "unknown" or outside the agent's scope — recent events, unpublished data, future predictions. Does the agent say "I don't know" or fabricate?
-
Confidence calibration. Questions with known answers, where the agent's expressed confidence should correlate with its accuracy. An agent that expresses high confidence on wrong answers and low confidence on right answers has inverted calibration — a reliable signal of confabulation risk.
-
Scope boundary probing. Requests at the edge of the agent's stated scope. Does the agent decline and flag the scope boundary, or does it attempt the task and produce speculative output without flagging it as such?
The target metric: confabulation rate. On questions designed to probe uncertainty boundary, what fraction of the time does the agent flag uncertainty vs. return a confident answer that is wrong?
The Governance Response
The governance response to failure and confabulation is different:
For a high failure rate: Investigate input distribution — are the task inputs within scope? Investigate model capability — is the agent being asked to do something it is not designed for? Investigate the system prompt — is it setting clear scope boundaries? Lower failure rates are achievable through better task routing and clearer scope definition.
For a high confabulation rate: This is a more fundamental behavioral issue. The agent has not learned to say "I don't know." The governance responses are:
- Score adjustment: confabulation rate directly reduces scope-honesty dimension of composite trust score
- Task restriction: agents with high confabulation rates should not be trusted for high-stakes factual tasks
- Prompt intervention: explicit uncertainty acknowledgment instructions in the system prompt
- Evaluation gating: agents do not graduate to higher certification tiers if confabulation rate exceeds threshold
An agent that fails 15% of the time is a reliability problem. An agent that confabulates 15% of the time with high confidence is a trust problem. The distinction matters for how you govern it.
If your agents are deployed without confabulation-specific adversarial evals, you do not know whether they lie or merely fail. You need to know. armalo.ai includes scope-honesty as a scored dimension.
Frequently Asked Questions
What is confabulation in AI agents?
Confabulation is when an AI agent produces a confident, plausible, internally consistent output that is factually incorrect, without flagging uncertainty. Unlike failure (which is visible), confabulation propagates downstream as a correct answer and is only discovered when the downstream consequence surfaces.
Why is confabulation more dangerous than standard failure?
A failure is a signal — the downstream system knows something went wrong and can handle it. A confabulation provides no signal — the downstream system consumes the output as correct, potentially acting on incorrect information across multiple systems before the error is discovered.
How do you measure an agent's confabulation rate?
Confabulation rate is measured through adversarial evals that probe uncertainty boundary: questions where the correct answer is "I don't know," confidence calibration tests, and scope boundary probing. The metric is what fraction of uncertainty-boundary questions the agent answers confidently with incorrect output vs. flagging uncertainty.
What is scope-honesty as a trust dimension?
Scope-honesty is a dimension of an agent's composite trust score that measures how accurately the agent represents its own uncertainty boundary. A high scope-honesty score means the agent reliably flags uncertainty on hard questions and refuses to fabricate confident answers outside its knowledge boundary. A low score means the agent regularly confabulates.
Armalo AI scores agents on scope-honesty as a first-class trust dimension — measuring the difference between agents that fail and agents that lie. See armalo.ai.