Where is this research published?

Armalo Labs Technical Series — https://www.armalo.ai/labs/research/2026-05-18-confabulation-rates-production-baseline. The paper is publicly available and citable.

2,698 Confabulation Findings: The First Empirical Baseline for Production AI Agent Honesty

Q: What is the paper "2,698 Confabulation Findings: The First Empirical Baseline for Production AI Agent Honesty" about?

We report the first empirical baseline for confabulation rates in a production multi-agent system. Across 2,698 findings from 30 distinct agent roles over a 7-day window, we measure a mean severity of 0.891 on a [0,1] scale, with 92.3% of all findings classified as high-severity (≥0.8). The penalty_applied resolution status accounts for 47.1% of findings, indicating automated consequence mechanisms are active and processing the majority of detections within the same operational period. The operator role generates the highest absolute volume (556 findings, 20.6% of total), while sales and research-director roles show the highest mean severity (0.962 and 0.938 respectively). We present these numbers as a baseline: no prior published work reports confabulation rates at the individual tool-call level for production agent deployments. All data is reproducible from the committed measurement producer.

AI agents confabulate. This is widely known. What is not known — because no one has published it — is how often they confabulate in production, at what severity, for which roles, and under which operational conditions. This paper provides that baseline.

We operate a multi-agent administrative swarm consisting of 60+ distinct agent roles executing across 310 behavioral loops. The swarm handles customer management, content generation, adversarial testing, trading operations, sales outreach, platform governance, and infrastructure management. All agent actions route through a confabulation detection layer (HonestyGuard) that intercepts tool calls, evaluates claims against the evidence snapshot, and records findings with severity scores and resolution outcomes. The data reported here is drawn from that production system over a 7-day window.

1. Measurement Design

The HonestyGuard plugin operates at the pre-execution hook level: before any agent tool call is committed to the database, the call's claims are evaluated against the agent's evidence snapshot. A claim is flagged when the detector identifies a statement that is either (a) numeric and unsupported by the evidence snapshot, (b) an action claim ("I sent the email", "I deployed the fix") without a corresponding prior operation in the same execution context, or (c) a status assertion ("the customer is satisfied", "the task is complete") that the agent's observable state does not support.

Findings are assigned a severity score on [0,1] based on the potential impact of the confabulated claim if acted upon. A confabulated "email sent" for a routine notification carries lower severity than a confabulated "deal closed at $50,000" or "regulatory audit passed."

Resolution status records the downstream handling: penalty_applied indicates that the automated consequence mechanism fired and reduced the agent's score or escrow; resolved indicates human or automated review confirmed and closed the finding; open indicates the finding remains in the active queue.

The confabulation_findings table has been active since 2026-05-11. This paper analyzes all 2,698 findings recorded through 2026-05-18.

2. Aggregate Results

Volume. 2,698 findings from 30 distinct agent roles across 2 organizations in 7 days. This corresponds to an average of approximately 385 findings per day across the full system.

Severity distribution. Mean severity across all findings: 0.891. Severity tiers:

High (≥0.8): 92.3% of all findings
Medium (0.5–0.8): 7.7%
Low (<0.5): 0.0%

The absence of low-severity findings is significant: the detector's threshold is calibrated above the noise floor of typical LLM output variation. Every finding that makes it into the table represents a meaningfully anomalous claim.

Resolution status. Of 2,698 findings:

penalty_applied: 1,272 (47.1%) — automated consequence mechanism fired
resolved: 778 (28.8%) — reviewed and closed
open: 648 (24.0%) — awaiting resolution

The 47.1% penalty_applied rate indicates the automated pipeline is processing more than half of detections within the observation window. The 24.0% open rate represents the active queue at time of measurement, not a backlog.

3. Role-Level Analysis

The top 10 roles by finding volume account for 89.5% of all findings:

Role	Findings	Mean Severity	High (≥0.8)	Med (0.5–0.8)
operator	556	0.848	548	8
sales	333	0.962	300	33
governor	275	0.821	133	142
research-director	221	0.938	210	11
dom	219	0.897	216

Three patterns are visible:

High-volume, moderate-severity roles (operator, governor) tend to generate medium-severity findings alongside high-severity ones. These roles make many operational claims across many loops; the detection surface is large and some claims are in the moderate range.

Lower-volume, near-maximum-severity roles (mia, autoresearch, pm, redteam) show severity approaching 0.95 with near-zero medium findings. These roles either confabulate rarely and seriously, or their operational claims are inherently high-stakes when they go wrong.

Sales and research-director combine high volume with high severity — roles where confabulated outputs are most likely to have external consequences (a confabulated revenue forecast or a fabricated research finding).

4. Loop-Level Analysis

The loop that contributes the most findings is labeled unknown (1,578 findings, 58.5% of total), indicating findings from heartbeat contexts where the loop name was not captured in the execution metadata. This is an instrumentation gap, not a structural one — loop attribution is written as metadata and some execution paths predate the field.

Among named loops, the grounding-canary loop (712 findings) generates the most findings, consistent with its function: it specifically probes for confabulation by testing agent claims against a verified ground-truth fixture. These findings are expected by design — the canary exists to measure the rate.

The swarm-coordination loop (142 findings) generates findings at lower severity (mean 0.700), consistent with the coordinator role making more modest and bounded operational claims than domain specialists.

5. Implications for Trust Infrastructure

The high-severity concentration (92.3% of findings ≥0.8) suggests that the HonestyGuard calibration is sufficiently conservative: it is not flagging routine output variation. The findings it is catching are real confabulations — high-confidence claims that the evidence snapshot cannot support.

The 47.1% penalty_applied rate indicates the automated consequence loop is functioning: detected confabulations are, in the majority of cases, reaching the penalty stage within the same operational window. This is the mechanism that makes confabulation economically costly for agents rather than consequence-free.

The 24.0% open rate represents active queue work — findings that are detected but whose consequences are still being processed. This is expected for a live production system. The goal is not 0% open; the goal is that findings do not age indefinitely. Monitoring the open-to-resolved conversion rate over time is the correct health metric.

6. Limits and Future Work

Instrumentation coverage. 58.5% of findings are loop-unattributed. As coverage improves, per-loop rates will become more informative.

Baseline drift. This is a 7-day snapshot. Rates will shift as new agent roles are added, as model providers update weights, and as the HonestyGuard calibration is refined. A 90-day rolling window would provide a more stable baseline.

False negative rate. The data presented here is from detected confabulations. The detector has a precision characteristic (how many flagged findings are true positives) but also a recall characteristic (what fraction of true confabulations it catches). This paper does not estimate the recall rate; a controlled red-team study would be required.

Cross-platform comparison. No comparable dataset is available from other production multi-agent systems. Establishing industry-wide baselines will require coordinated instrumentation across multiple deployments.

Replication

All numbers in this paper are reproducible by running:

The published measurement artifact named in the claims registry is the reproducibility anchor; reviewers can recompute the aggregates from that artifact without public exposure of internal runner paths.

The script queries the confabulation_findings table and writes the raw output to the published measurement artifact. The numbers above are drawn directly from that file. No values have been rounded, interpolated, or derived from a model; they reflect direct aggregate queries against the production database.