Where is this research published?

Armalo Labs Technical Series — https://www.armalo.ai/labs/research/2026-06-12-proof-debt-ledger-agent-research. The paper is publicly available and citable.

Proof Debt Is the New Technical Debt: A Ledger for Agent Research Claims

Q: What is the paper "Proof Debt Is the New Technical Debt: A Ledger for Agent Research Claims" about?

Agent systems increasingly ship traces, evals, receipts, and research claims faster than their evidence can be re-checked. We introduce proof debt: the state where a claim still has a source, but the source, producer, or scope has changed enough that repeating the claim would overstate what is currently proven. We ran a deterministic ledger over Armalo's public research claims registry, public research papers, and future Labs preregistration agenda. The ledger audited 282 claim units. It found zero missing-source errors and zero unregistered post-effective-date research papers, but 20 refresh-required code-reference claims, for a stale_overclaim_rate of 0.0709. The reusable artifact is a four-bucket proof-debt scorecard: registry integrity, freshness integrity, public paper coverage, and future Labs materialization. The result suggests that claim registries catch fabricated or missing evidence, but fast-moving agent systems also need refresh-before-repeat gates for code-backed claims.

Agent research does not usually fail because no evidence ever existed. The quieter failure is that evidence gets old while the claim keeps traveling. A benchmark result, code-reference claim, model-routing statement, or production measurement may have been true at publication time and still become risky when the underlying source changes. We call that gap proof debt.

Proof debt is not ordinary technical debt. Technical debt lives in the implementation. Proof debt lives in the claim layer: the paper, dashboard, benchmark card, investor memo, sales deck, trust score, or agent receipt that repeats a result after its proof boundary has moved.

This matters because the agent ecosystem is becoming evidence-heavy. OpenAI's Agents SDK emphasizes traces, evaluations, tools, and human review as part of agent workflow improvement ([OpenAI Agents SDK](https://developers.openai.com/api/docs/guides/agents), retrieved 2026-06-12). Anthropic's Model Context Protocol makes tool and data connections portable across agent systems ([Anthropic MCP](https://www.anthropic.com/news/model-context-protocol), retrieved 2026-06-12). SWE-bench Verified demonstrates how agent claims become more useful when tasks, scoring, and review quality are explicit ([SWE-bench Verified](https://www.swebench.com/verified.html), retrieved 2026-06-12). NIST's AI Risk Management Framework frames trustworthy AI around validity, reliability, accountability, transparency, and governance ([NIST AI RMF](https://www.nist.gov/itl/ai-risk-management-framework), retrieved 2026-06-12).

Those trends all point in the same direction: agent systems need claims that carry freshness, source authority, and repeatability, not just impressive numbers.

Method

We ran scripts/research-experiments/proof-debt-ledger.mjs, a deterministic scanner over committed public Armalo research artifacts. The unit of analysis is a public research claim or a post-effective-date public research paper.

The script reads:

apps/web/content/research/claims-registry.json
apps/web/content/research/*.md
docs/research/2026-06-12-agentic-ai-future-labs-preregistrations.md

The script writes:

apps/web/content/research/data/proof-debt-ledger.json

The registry already requires every empirical claim to carry one of four provenance kinds: measurement, code-ref, derivation, or projection. The proof-debt ledger adds freshness and repetition pressure on top of that provenance model.

The primary metric is:

stale_overclaim_rate = open proof-debt items / audited claim units

The ledger marks an item as open proof debt when one of these conditions holds:

Condition	Meaning	Action
Missing source	A claim lacks a source pointer or the referenced source is absent	Restore evidence or remove the claim
Missing producer	A measurement claim lacks a producer script	Add producer path or downgrade the claim
Producer changed after data	The script changed after the measurement data was produced	Re-run the data before repeating
Code changed after paper	A code-backed claim references source that changed after publication	Re-check the current source before repeating
Unregistered paper	A post-effective-date research paper is absent from the registry	Register claims or document no empirical claims
Ambiguous projection	A projection is not clearly labeled as estimated, illustrative, conditional, or unmeasured	Relabel or rewrite

This is a conservative method. It does not claim that a code-backed claim is false when the source file changes. It says the claim has proof debt: repeating it now requires a refresh.

Evidence

The run produced this scorecard:

Bucket	Result	Evidence
Registry integrity	pass	0 registered claims have missing required evidence
Freshness integrity	refresh_required	20 registered claims need source or producer refresh before repetition
Public paper coverage	pass	0 post-effective-date papers are absent from the claims registry
Future Labs materialization	pending_evidence	1 of 6 future experiment targets has both producer and evidence artifacts

The headline numbers from apps/web/content/research/data/proof-debt-ledger.json are:

Metric	Value
Audited claim units	282
Registered claims	282
Registered claims current	255
Open proof-debt items	20
Missing-source or missing-producer errors	0
Refresh-required code-reference claims	20
Post-effective-date unregistered papers	0
Future experiment targets materialized	1 of 6
stale_overclaim_rate	0.0709

The measured failure mode is narrow and useful. The registry is doing its job: no missing evidence, no missing producers, no unregistered new public research papers. The open debt is freshness debt on code references. Claims like scoring dimension counts, rate limits, verifier gates, and source-derived constants may be valid, but the referenced code files changed after the paper dates. Those claims should not be repeated without a refresh pass.

Interpretation

The surprising result is not that Armalo has 20 proof-debt items. The surprising result is that all 20 are the same class: code_changed_after_paper.

That tells us the current integrity system catches the first generation of research failure: fabricated numbers, missing data files, missing producer scripts, and unregistered papers. It does not yet catch the second generation: correct-at-publication claims that need revalidation after source changes.

This is exactly the kind of failure mode agent systems will hit as they become more autonomous. Agents will create receipts, papers, eval reports, benchmark cards, launch notes, trust-score explanations, and audit summaries. If those artifacts cite live code, runtime gates, model routers, tool policies, or scores, the proof boundary changes whenever those sources change.

The fix is not to stop publishing. The fix is to make every repeated claim pass a small ledger check:

1.Does the source still exist?
2.Did the producer change after the evidence artifact?
3.Did the code source change after the paper?
4.Is the claim dated, refreshed, downgraded, or removed?
5.If it is a projection, is it visibly labeled as a projection?

Reusable Framework

The reusable object is a proof-debt scorecard.

Scorecard bucket	Question	Promotion rule
Registry integrity	Do all empirical claims have a valid provenance and source?	Missing evidence blocks publication
Freshness integrity	Did evidence producers or code sources change after the claim?	Changed sources require refresh before repetition
Public paper coverage	Are all post-effective-date papers in the claim registry?	Unregistered papers cannot be promoted
Future Labs materialization	Do proposed experiments have producer and evidence artifacts?	Forecast-only claims stay forecast-only until materialized

The scorecard is intentionally small. It can attach to a research paper, benchmark card, product proof page, agent receipt, or governance review. It does not decide whether the claim is strategically important. It decides whether the proof is current enough to repeat.

Boundary And Falsification

The ledger reads only committed public files and git metadata. It does not query private customer data, raw prompts, tenant records, credentials, or production database rows. The public boundary is therefore strong: aggregate counts, public file paths, claim phrases, provenance classes, and refresh actions are publishable.

The main limitation is that file-level freshness is a conservative proxy. A file can change without invalidating the specific line or constant a paper cited. The ledger therefore marks refresh-required, not false. The right follow-up is a line-aware verifier that checks whether the cited symbol, constant, or route behavior changed, not merely whether the file changed.

The claim would be falsified if a line-aware verifier shows that the 20 flagged code-reference claims still match current source exactly, or if a future run finds that refresh-before-repeat gates do not reduce repeated stale claims. The policy simulation in the data file says unresolved stale_overclaim_rate would fall to 0 if all ledger actions were applied before repetition. That is a deterministic simulation, not a measured later-cycle outcome.

Replication

Run:

node scripts/research-experiments/proof-debt-ledger.mjs

This writes:

apps/web/content/research/data/proof-debt-ledger.json

Then run:

pnpm research:audit
pnpm exec tsx -e "import fs from 'node:fs'; import { auditResearchArtifactQuality } from './packages/db/src/research-artifact-quality-guard'; const content=fs.readFileSync('apps/web/content/research/2026-06-12-proof-debt-ledger-agent-research.md','utf8'); const paper={slug:'2026-06-12-proof-debt-ledger-agent-research',title:'Proof Debt Is the New Technical Debt: A Ledger for Agent Research Claims',abstract:'Deterministic proof-debt ledger over public Armalo research claims.',content}; const experiment={slug:'proof-debt-ledger',paperSlug:paper.slug,methodology:'Measure, compare, and evaluate public research claim provenance, source freshness, registry coverage, and future experiment materialization on committed files with a deterministic ledger and reproducible JSON output.',config:{primaryMetric:'stale_overclaim_rate',promotionGate:'Promote only when a later cycle shows at least 30 percent lower stale_overclaim_rate without reducing true-positive claim coverage and with all evidence artifacts present.',evidenceArtifact:'apps/web/content/research/data/proof-debt-ledger.json',publicBoundary:'Public boundary excludes private customer data, secrets, raw prompts, proprietary internal payloads, tenant identifiers, and unsafe operational details while publishing aggregate methods and outcomes.'}}; const report=auditResearchArtifactQuality([paper],[experiment],{minimumPaperWords:650,minimumExternalSources:2}); console.log(JSON.stringify(report,null,2)); if(!report.passed) process.exit(1);"

The measurement script and data file are the source of the quantitative claims in this paper. The claims are registered in apps/web/content/research/claims-registry.json.

Conclusion

Proof debt is now measurable. In this run, the first-order research integrity gates worked: no missing sources, no missing producers, and no unregistered new research papers. The remaining gap is freshness. Agent systems that publish claims against live code and runtime policies need a refresh-before-repeat ledger, because source drift turns yesterday's valid evidence into tomorrow's overclaim.