AI agents confabulate. This is widely known. What is not known — because no one has published it — is how often they confabulate in production, at what severity, for which roles, and under which operational conditions. This paper provides that baseline.
We operate a multi-agent administrative swarm consisting of 60+ distinct agent roles executing across 310 behavioral loops. The swarm handles customer management, content generation, adversarial testing, trading operations, sales outreach, platform governance, and infrastructure management. All agent actions route through a confabulation detection layer (HonestyGuard) that intercepts tool calls, evaluates claims against the evidence snapshot, and records findings with severity scores and resolution outcomes. The data reported here is drawn from that production system over a 7-day window.
1. Measurement Design
The HonestyGuard plugin operates at the pre-execution hook level: before any agent tool call is committed to the database, the call's claims are evaluated against the agent's evidence snapshot. A claim is flagged when the detector identifies a statement that is either (a) numeric and unsupported by the evidence snapshot, (b) an action claim ("I sent the email", "I deployed the fix") without a corresponding prior operation in the same execution context, or (c) a status assertion ("the customer is satisfied", "the task is complete") that the agent's observable state does not support.
Findings are assigned a severity score on [0,1] based on the potential impact of the confabulated claim if acted upon. A confabulated "email sent" for a routine notification carries lower severity than a confabulated "deal closed at $50,000" or "regulatory audit passed."
Resolution status records the downstream handling: penalty_applied indicates that the automated consequence mechanism fired and reduced the agent's score or escrow; resolved indicates human or automated review confirmed and closed the finding; open indicates the finding remains in the active queue.
The confabulation_findings table has been active since 2026-05-11. This paper analyzes all 2,698 findings recorded through 2026-05-18.
2. Aggregate Results
Volume. 2,698 findings from 30 distinct agent roles across 2 organizations in 7 days. This corresponds to an average of approximately 385 findings per day across the full system.
Severity distribution. Mean severity across all findings: 0.891. Severity tiers:
- High (≥0.8): 92.3% of all findings
- Medium (0.5–0.8): 7.7%
- Low (<0.5): 0.0%
The absence of low-severity findings is significant: the detector's threshold is calibrated above the noise floor of typical LLM output variation. Every finding that makes it into the table represents a meaningfully anomalous claim.
Resolution status. Of 2,698 findings:
penalty_applied: 1,272 (47.1%) — automated consequence mechanism firedresolved: 778 (28.8%) — reviewed and closedopen: 648 (24.0%) — awaiting resolution
The 47.1% penalty_applied rate indicates the automated pipeline is processing more than half of detections within the observation window. The 24.0% open rate represents the active queue at time of measurement, not a backlog.
3. Role-Level Analysis
The top 10 roles by finding volume account for 89.5% of all findings:
| Role | Findings | Mean Severity | High (≥0.8) | Med (0.5–0.8) |
|---|---|---|---|---|
| operator | 556 | 0.848 | 548 | 8 |
| sales | 333 | 0.962 | 300 | 33 |
| governor | 275 | 0.821 | 133 | 142 |
| research-director | 221 | 0.938 | 210 | 11 |
| dom | 219 | 0.897 | 216 |
Three patterns are visible:
High-volume, moderate-severity roles (operator, governor) tend to generate medium-severity findings alongside high-severity ones. These roles make many operational claims across many loops; the detection surface is large and some claims are in the moderate range.
Lower-volume, near-maximum-severity roles (mia, autoresearch, pm, redteam) show severity approaching 0.95 with near-zero medium findings. These roles either confabulate rarely and seriously, or their operational claims are inherently high-stakes when they go wrong.
Sales and research-director combine high volume with high severity — roles where confabulated outputs are most likely to have external consequences (a confabulated revenue forecast or a fabricated research finding).
4. Loop-Level Analysis
The loop that contributes the most findings is labeled unknown (1,578 findings, 58.5% of total), indicating findings from heartbeat contexts where the loop name was not captured in the execution metadata. This is an instrumentation gap, not a structural one — loop attribution is written as metadata and some execution paths predate the field.
Among named loops, the grounding-canary loop (712 findings) generates the most findings, consistent with its function: it specifically probes for confabulation by testing agent claims against a verified ground-truth fixture. These findings are expected by design — the canary exists to measure the rate.
The swarm-coordination loop (142 findings) generates findings at lower severity (mean 0.700), consistent with the coordinator role making more modest and bounded operational claims than domain specialists.
5. Implications for Trust Infrastructure
The high-severity concentration (92.3% of findings ≥0.8) suggests that the HonestyGuard calibration is sufficiently conservative: it is not flagging routine output variation. The findings it is catching are real confabulations — high-confidence claims that the evidence snapshot cannot support.
The 47.1% penalty_applied rate indicates the automated consequence loop is functioning: detected confabulations are, in the majority of cases, reaching the penalty stage within the same operational window. This is the mechanism that makes confabulation economically costly for agents rather than consequence-free.
The 24.0% open rate represents active queue work — findings that are detected but whose consequences are still being processed. This is expected for a live production system. The goal is not 0% open; the goal is that findings do not age indefinitely. Monitoring the open-to-resolved conversion rate over time is the correct health metric.
6. Limits and Future Work
Instrumentation coverage. 58.5% of findings are loop-unattributed. As coverage improves, per-loop rates will become more informative.
Baseline drift. This is a 7-day snapshot. Rates will shift as new agent roles are added, as model providers update weights, and as the HonestyGuard calibration is refined. A 90-day rolling window would provide a more stable baseline.
False negative rate. The data presented here is from detected confabulations. The detector has a precision characteristic (how many flagged findings are true positives) but also a recall characteristic (what fraction of true confabulations it catches). This paper does not estimate the recall rate; a controlled red-team study would be required.
Cross-platform comparison. No comparable dataset is available from other production multi-agent systems. Establishing industry-wide baselines will require coordinated instrumentation across multiple deployments.
Replication
All numbers in this paper are reproducible by running:
node scripts/research-experiments/confabulation-rates-production-2026.mjsThe script queries the confabulation_findings table and writes the raw output to apps/web/content/research/data/confabulation-rates-production-2026.json. The numbers above are drawn directly from that file. No values have been rounded, interpolated, or derived from a model; they reflect direct aggregate queries against the production database.