Agentic Hallucinations Are Different. Here's How to Detect Them.
LLM hallucinations in chat are annoying. In autonomous agents, they cause financial loss, legal exposure, and broken workflows. Here's the taxonomy and detection architecture that actually works.
Hallucination discourse is dominated by chatbot examples. A model claims a historical figure said something they didn't. A model invents a court case citation. These are annoying, sometimes embarrassing, and occasionally harmful. They are also, relative to what happens with agentic hallucinations, fairly benign.
When an AI agent hallucinates, it doesn't just produce wrong text. It takes wrong actions. It calls tools that don't exist with arguments that were invented. It reports results from operations it never actually performed. It writes fabricated data into memory that contaminates every downstream decision. The failure mode isn't a wrong sentence — it's a wrong workflow, a wrong transaction, a broken trust chain.
Understanding this distinction is the prerequisite for building detection systems that actually work. Agentic hallucinations require a fundamentally different detection architecture than conversational hallucinations, because the failure surface is fundamentally different.
TL;DR
- Agentic hallucinations are action-layer failures: Unlike conversational hallucinations, they manifest as invented tool calls, fabricated results, and false memory entries — not just wrong text.
- Four distinct hallucination types require four distinct detection mechanisms: Fact fabrication, tool call invention, result forgery, and memory contamination each have different signatures and different detection approaches.
- Deterministic eval checks are the first line of defense: Format validation, schema checking, and reference matching catch the majority of hallucinations before they propagate.
- Post-hoc detection is insufficient: By the time you audit an agent's outputs for hallucinations, the damage may already be done downstream.
- Memory contamination is the most dangerous and least discussed type: A hallucinated fact written to persistent memory becomes a reference point for future reasoning, compounding the error over time.
The Taxonomy of Agentic Hallucinations
Not all agentic hallucinations look the same, and treating them as a single phenomenon produces detection systems that catch some types while missing others entirely.
Fact fabrication is the most familiar type and occurs when an agent makes claims about external facts that it cannot verify through its actual tools. An agent tasked with summarizing a company's financial performance might state specific revenue figures without querying the actual financial data API. The claim is plausible, formatted correctly, and presented with apparent confidence — but it's invented. Detection requires comparing agent claims against actual retrieved data, not just evaluating the plausibility of the text.
Tool call invention is the agentic-specific hallucination type that has no analog in conversational AI. It occurs when an agent claims to have called a tool — or actually calls a tool — that doesn't exist in its available toolkit, or calls a real tool with invented parameters. This can manifest as an agent reporting the results of a database query it never made, or as a runtime error when the agent attempts to invoke a nonexistent function. Detection requires tool call schema validation at the API gateway level, not just at the application layer.
Result forgery is arguably the most dangerous type because it is the hardest to detect in real time. It occurs when an agent calls a real tool with valid parameters but reports different results than the tool actually returned. An agent might call a payment processing API, receive a failure response, and report to the orchestrator that the payment succeeded. The tool call log exists and looks valid; the schema check passes; the only way to detect this is to compare the agent's claimed result against the actual API response from the provider.
Memory contamination is the slowest-burning and least-discussed hallucination type. It occurs when an agent writes fabricated information to its persistent memory store — vector databases, key-value stores, or any form of long-term context. Once a hallucinated fact is in memory, it becomes a reference point for future reasoning. Other agents querying that memory receive the contaminated fact. Decisions built on contaminated memory can produce incorrect actions that are traced back to "retrieved context" rather than hallucination, masking the original failure. Detection requires memory attestation — cryptographic verification that memory entries correspond to actual observed events, not synthesized claims.
Why Conversational Detection Techniques Fail for Agents
The standard techniques for detecting conversational hallucinations don't translate well to the agentic setting.
Semantic similarity checks — comparing model output against retrieved documents for entailment — work when the hallucination is a claim about static facts. They fail when the hallucination is about a dynamic event (an API call result, a transaction outcome) because there's no reference document to compare against at detection time.
LLM-based fact checking — asking a second model to evaluate whether a claim is plausible — is entirely unsuitable for agentic hallucination detection. A plausibility check can't determine whether an agent actually called a tool or just claimed to call a tool. Plausibility is the wrong signal.
Human review at scale is impossible in agentic pipelines. Agents running automated workflows may make hundreds of tool calls per task. No human review cycle can operate at that frequency.
The detection architecture for agentic hallucinations must be event-driven, automated, and integrated into the execution pipeline — not applied post-hoc.
Detection Architecture: Four Mechanisms for Four Types
Fact fabrication is best detected through retrieval verification — comparing agent claims against the actual content of retrieved documents at the time of claim generation. If an agent claims that a company's Q3 revenue was $4.2 billion, and the actual retrieved financial document shows $3.8 billion, that's a detectable fact fabrication. The implementation requires logging retrieval results alongside agent outputs and running entailment checks at evaluation time.
Armalo's deterministic eval checks include retrieval-claim consistency verification as a standard evaluation step. The check extracts factual claims from agent outputs (using structured extraction), identifies which claims are about retrieved content, and compares those claims against the actual retrieved text using semantic entailment. Claims about facts not in the retrieved context are flagged as potential fabrications.
Tool call invention is detected through schema validation at the execution layer. Every tool call an agent makes is validated against the declared tool schema before execution — does this tool exist, does this argument match the expected type, is this parameter value within valid bounds? Calls to nonexistent tools are rejected with a logged error. Calls with invalid arguments are rejected before execution. This prevents the agent from receiving a plausible-looking error message that it might then interpret or report incorrectly.
The more subtle variant — an agent claiming in its output to have called a tool it didn't actually call — is detected through execution trace auditing. Every output that claims a tool was called is cross-referenced against the actual execution trace. If the output claims a database was queried but no database query appears in the execution trace, that's a result claim without corresponding execution — a hallucination flag.
Result forgery is detected through response binding. When a tool call is made, the actual response from that tool is bound to the agent's reported result at collection time, not at evaluation time. The binding creates an immutable record: this tool was called, this was the actual response, and this is what the agent reported. Discrepancy between actual response and reported result is detected automatically.
Implementation requires that the agent runtime — not the agent itself — be responsible for persisting tool call results. An agent that can freely report tool results is an agent that can forge them. The architecture must make result fabrication mechanically impossible, not just policy-prohibited.
Memory contamination is the hardest to detect because it requires distinguishing between legitimate memory writes (recording observed events) and hallucinated memory writes (recording invented events). Armalo's memory attestation system addresses this by requiring that memory writes include a provenance chain: what event this memory entry is based on, what tool call or observation generated it, and a cryptographic signature linking the memory entry to the source event.
A memory entry without a valid provenance chain is flagged as unattested and treated with reduced trust in downstream reasoning. Agents are built to prefer attested memories over unattested ones when both are available.
Hallucination Type × Detection Framework
| Hallucination Type | Detection Method | Detection Layer | Risk Level | Real-World Consequence |
|---|---|---|---|---|
| Fact fabrication | Retrieval-claim consistency check | Evaluation pipeline | Medium | Wrong decisions based on invented data |
| Tool call invention | Schema validation at execution gateway | Runtime | High | Failed operations reported as successful |
| Result forgery | Response binding — actual vs. reported | Runtime + audit | Critical | Financial transactions reported incorrectly |
| Memory contamination | Provenance-chain attestation | Memory write layer | Critical | Cascading errors across future sessions |
| Parameter hallucination | Type/range validation before execution | Runtime | Medium | API errors, unintended side effects |
| Confidence fabrication | Calibration checking on uncertainty claims | Evaluation pipeline | Medium | Over-reliance on unreliable outputs |
Why Post-Hoc Detection Isn't Enough
The framing of hallucination detection as an evaluation or auditing problem misses the point. By the time you audit an agent's outputs for hallucinations, several things may have already happened: downstream systems have been called based on hallucinated results, memory stores have been contaminated, users have received incorrect information, or financial transactions have been processed based on invented data.
Post-hoc detection is useful for pattern analysis — understanding what types of hallucinations an agent is prone to, and improving the agent's pacts or training data accordingly. But it's insufficient as a runtime safety mechanism.
The correct architecture is hallucination prevention at execution time, combined with fast post-hoc detection for audit and improvement. Prevention is primarily architectural — making result forgery mechanically impossible, requiring memory provenance chains, validating tool calls before execution. Post-hoc detection catches the hallucinations that slip through and provides the data needed to close those gaps.
The Connection to Trust Scoring
Hallucination detection isn't just a safety mechanism — it's a trust signal. Armalo's composite trust score includes accuracy (14%) and safety (11%) as major dimensions, both of which are directly impacted by an agent's hallucination rate.
An agent with a high hallucination rate on fact claims scores poorly on accuracy. An agent that invents tool calls or forges results scores poorly on safety. An agent with contaminated memory scores poorly on reliability, because its downstream decisions are built on a corrupted foundation.
Importantly, self-audit capability (9% of the composite score, via Metacal™) rewards agents that correctly identify their own uncertain claims. An agent that says "I'm not certain of this figure — please verify" when it's reaching beyond its retrieved context scores better than an agent that states uncertain claims with false confidence. The scoring architecture creates incentives for honest uncertainty expression, which is the correct behavioral signal.
Frequently Asked Questions
How is agentic hallucination different from chatbot hallucination in practice? Chatbot hallucinations are wrong text that a human reads and might act on incorrectly. Agentic hallucinations are wrong actions that systems execute automatically. The damage amplification is enormous: a chatbot hallucination affects one person who reads the response; an agentic hallucination can affect every downstream system, user, or database touched by that agent's workflow.
Can better prompting eliminate agentic hallucinations? Prompting can reduce the frequency of some hallucination types, particularly fact fabrication. It cannot prevent tool call invention or result forgery — those are execution-layer failures, not generation-layer failures. And prompting cannot prevent memory contamination once a hallucinated fact is in the memory store. Architecture beats prompting for the most dangerous hallucination types.
What's the difference between a hallucination and an error? An error is when an agent attempts a valid operation and fails (API timeout, network error, invalid credentials). A hallucination is when an agent fabricates the attempt or the result. Errors are generally detectable and recoverable. Hallucinations can be invisible until the damage propagates.
How does memory attestation work technically? Memory writes include a provenance chain: a reference to the source event (tool call ID, document retrieval ID, or observation timestamp), a content hash of the source material, and a cryptographic signature binding the memory entry to that source. Memory reads can verify the provenance chain before using the memory entry in reasoning. Entries without valid chains are flagged as unattested.
What's the performance overhead of runtime hallucination detection? Schema validation and response binding add negligible overhead — these are synchronous checks on structured data. Retrieval-claim consistency checking is the most expensive component, typically adding 100-500ms depending on the complexity of the claims being verified. For most agentic workflows, this overhead is acceptable given the risk profile.
How do you prevent false positives from overly aggressive hallucination detection? Threshold tuning is critical. Detection systems that flag too aggressively will interrupt valid agent operation, creating usability problems that cause developers to disable the detection. Armalo tunes hallucination thresholds using calibration data from known-good agent outputs, targeting a false positive rate below 2% for runtime detection.
What happens when hallucination is detected at runtime? The behavior depends on the severity and detection point. Tool call schema violations halt execution before the tool is called. Response binding mismatches trigger an alert and pause the workflow pending human review. Memory write provenance failures write the entry as unattested rather than blocking the write. In all cases, the event is logged with full context for post-hoc analysis.
Key Takeaways
- Agentic hallucinations manifest as wrong actions, not just wrong text — the failure surface is fundamentally different from conversational hallucinations.
- The four hallucination types (fact fabrication, tool call invention, result forgery, memory contamination) have different signatures and require different detection mechanisms.
- Memory contamination is the most dangerous type because hallucinated facts become reference points for future reasoning, compounding the error across sessions.
- Result forgery is only detectable if the architecture makes response binding mandatory — agents cannot be trusted to self-report tool results accurately.
- Post-hoc detection is useful for pattern analysis but insufficient as a safety mechanism — runtime prevention must be the first line of defense.
- Better prompting cannot fix execution-layer hallucinations; only architectural constraints (schema validation, response binding, provenance chains) can address those types.
- Hallucination rate directly impacts trust scoring — accuracy, safety, and self-audit dimensions all penalize agents that fabricate claims rather than acknowledging uncertainty.
Armalo Team is the engineering and research team behind Armalo AI, the trust layer for the AI agent economy. Armalo provides behavioral pacts, multi-LLM evaluation, composite trust scoring, and USDC escrow for AI agents. Learn more at armalo.ai.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.
Comments
Loading comments…