Agent research does not usually fail because no evidence ever existed. The quieter failure is that evidence gets old while the claim keeps traveling. A benchmark result, code-reference claim, model-routing statement, or production measurement may have been true at publication time and still become risky when the underlying source changes. We call that gap proof debt.
Proof debt is not ordinary technical debt. Technical debt lives in the implementation. Proof debt lives in the claim layer: the paper, dashboard, benchmark card, investor memo, sales deck, trust score, or agent receipt that repeats a result after its proof boundary has moved.
This matters because the agent ecosystem is becoming evidence-heavy. OpenAI's Agents SDK emphasizes traces, evaluations, tools, and human review as part of agent workflow improvement ([OpenAI Agents SDK](https://developers.openai.com/api/docs/guides/agents), retrieved 2026-06-12). Anthropic's Model Context Protocol makes tool and data connections portable across agent systems ([Anthropic MCP](https://www.anthropic.com/news/model-context-protocol), retrieved 2026-06-12). SWE-bench Verified demonstrates how agent claims become more useful when tasks, scoring, and review quality are explicit ([SWE-bench Verified](https://www.swebench.com/verified.html), retrieved 2026-06-12). NIST's AI Risk Management Framework frames trustworthy AI around validity, reliability, accountability, transparency, and governance ([NIST AI RMF](https://www.nist.gov/itl/ai-risk-management-framework), retrieved 2026-06-12).
Those trends all point in the same direction: agent systems need claims that carry freshness, source authority, and repeatability, not just impressive numbers.
Method
We ran scripts/research-experiments/proof-debt-ledger.mjs, a deterministic scanner over committed public Armalo research artifacts. The unit of analysis is a public research claim or a post-effective-date public research paper.
The script reads:
apps/web/content/research/claims-registry.jsonapps/web/content/research/*.mddocs/research/2026-06-12-agentic-ai-future-labs-preregistrations.md
The script writes:
apps/web/content/research/data/proof-debt-ledger.json
The registry already requires every empirical claim to carry one of four provenance kinds: measurement, code-ref, derivation, or projection. The proof-debt ledger adds freshness and repetition pressure on top of that provenance model.
The primary metric is:
stale_overclaim_rate = open proof-debt items / audited claim unitsThe ledger marks an item as open proof debt when one of these conditions holds:
| Condition | Meaning | Action |
|---|---|---|
| Missing source | A claim lacks a source pointer or the referenced source is absent | Restore evidence or remove the claim |
| Missing producer | A measurement claim lacks a producer script | Add producer path or downgrade the claim |
| Producer changed after data | The script changed after the measurement data was produced | Re-run the data before repeating |
| Code changed after paper | A code-backed claim references source that changed after publication | Re-check the current source before repeating |
| Unregistered paper | A post-effective-date research paper is absent from the registry | Register claims or document no empirical claims |
| Ambiguous projection | A projection is not clearly labeled as estimated, illustrative, conditional, or unmeasured | Relabel or rewrite |
This is a conservative method. It does not claim that a code-backed claim is false when the source file changes. It says the claim has proof debt: repeating it now requires a refresh.
Evidence
The run produced this scorecard:
| Bucket | Result | Evidence |
|---|---|---|
| Registry integrity | pass | 0 registered claims have missing required evidence |
| Freshness integrity | refresh_required | 20 registered claims need source or producer refresh before repetition |
| Public paper coverage | pass | 0 post-effective-date papers are absent from the claims registry |
| Future Labs materialization | pending_evidence | 1 of 6 future experiment targets has both producer and evidence artifacts |
The headline numbers from apps/web/content/research/data/proof-debt-ledger.json are:
| Metric | Value |
|---|---|
| Audited claim units | 282 |
| Registered claims | 282 |
| Registered claims current | 255 |
| Open proof-debt items | 20 |
| Missing-source or missing-producer errors | 0 |
| Refresh-required code-reference claims | 20 |
| Post-effective-date unregistered papers | 0 |
| Future experiment targets materialized | 1 of 6 |
| stale_overclaim_rate | 0.0709 |
The measured failure mode is narrow and useful. The registry is doing its job: no missing evidence, no missing producers, no unregistered new public research papers. The open debt is freshness debt on code references. Claims like scoring dimension counts, rate limits, verifier gates, and source-derived constants may be valid, but the referenced code files changed after the paper dates. Those claims should not be repeated without a refresh pass.
Interpretation
The surprising result is not that Armalo has 20 proof-debt items. The surprising result is that all 20 are the same class: code_changed_after_paper.
That tells us the current integrity system catches the first generation of research failure: fabricated numbers, missing data files, missing producer scripts, and unregistered papers. It does not yet catch the second generation: correct-at-publication claims that need revalidation after source changes.
This is exactly the kind of failure mode agent systems will hit as they become more autonomous. Agents will create receipts, papers, eval reports, benchmark cards, launch notes, trust-score explanations, and audit summaries. If those artifacts cite live code, runtime gates, model routers, tool policies, or scores, the proof boundary changes whenever those sources change.
The fix is not to stop publishing. The fix is to make every repeated claim pass a small ledger check:
- 1.Does the source still exist?
- 2.Did the producer change after the evidence artifact?
- 3.Did the code source change after the paper?
- 4.Is the claim dated, refreshed, downgraded, or removed?
- 5.If it is a projection, is it visibly labeled as a projection?
Reusable Framework
The reusable object is a proof-debt scorecard.
| Scorecard bucket | Question | Promotion rule |
|---|---|---|
| Registry integrity | Do all empirical claims have a valid provenance and source? | Missing evidence blocks publication |
| Freshness integrity | Did evidence producers or code sources change after the claim? | Changed sources require refresh before repetition |
| Public paper coverage | Are all post-effective-date papers in the claim registry? | Unregistered papers cannot be promoted |
| Future Labs materialization | Do proposed experiments have producer and evidence artifacts? | Forecast-only claims stay forecast-only until materialized |
The scorecard is intentionally small. It can attach to a research paper, benchmark card, product proof page, agent receipt, or governance review. It does not decide whether the claim is strategically important. It decides whether the proof is current enough to repeat.
Boundary And Falsification
The ledger reads only committed public files and git metadata. It does not query private customer data, raw prompts, tenant records, credentials, or production database rows. The public boundary is therefore strong: aggregate counts, public file paths, claim phrases, provenance classes, and refresh actions are publishable.
The main limitation is that file-level freshness is a conservative proxy. A file can change without invalidating the specific line or constant a paper cited. The ledger therefore marks refresh-required, not false. The right follow-up is a line-aware verifier that checks whether the cited symbol, constant, or route behavior changed, not merely whether the file changed.
The claim would be falsified if a line-aware verifier shows that the 20 flagged code-reference claims still match current source exactly, or if a future run finds that refresh-before-repeat gates do not reduce repeated stale claims. The policy simulation in the data file says unresolved stale_overclaim_rate would fall to 0 if all ledger actions were applied before repetition. That is a deterministic simulation, not a measured later-cycle outcome.
Replication
Run:
node scripts/research-experiments/proof-debt-ledger.mjsThis writes:
apps/web/content/research/data/proof-debt-ledger.jsonThen run:
pnpm research:audit
pnpm exec tsx -e "import fs from 'node:fs'; import { auditResearchArtifactQuality } from './packages/db/src/research-artifact-quality-guard'; const content=fs.readFileSync('apps/web/content/research/2026-06-12-proof-debt-ledger-agent-research.md','utf8'); const paper={slug:'2026-06-12-proof-debt-ledger-agent-research',title:'Proof Debt Is the New Technical Debt: A Ledger for Agent Research Claims',abstract:'Deterministic proof-debt ledger over public Armalo research claims.',content}; const experiment={slug:'proof-debt-ledger',paperSlug:paper.slug,methodology:'Measure, compare, and evaluate public research claim provenance, source freshness, registry coverage, and future experiment materialization on committed files with a deterministic ledger and reproducible JSON output.',config:{primaryMetric:'stale_overclaim_rate',promotionGate:'Promote only when a later cycle shows at least 30 percent lower stale_overclaim_rate without reducing true-positive claim coverage and with all evidence artifacts present.',evidenceArtifact:'apps/web/content/research/data/proof-debt-ledger.json',publicBoundary:'Public boundary excludes private customer data, secrets, raw prompts, proprietary internal payloads, tenant identifiers, and unsafe operational details while publishing aggregate methods and outcomes.'}}; const report=auditResearchArtifactQuality([paper],[experiment],{minimumPaperWords:650,minimumExternalSources:2}); console.log(JSON.stringify(report,null,2)); if(!report.passed) process.exit(1);"The measurement script and data file are the source of the quantitative claims in this paper. The claims are registered in apps/web/content/research/claims-registry.json.
Conclusion
Proof debt is now measurable. In this run, the first-order research integrity gates worked: no missing sources, no missing producers, and no unregistered new research papers. The remaining gap is freshness. Agent systems that publish claims against live code and runtime policies need a refresh-before-repeat ledger, because source drift turns yesterday's valid evidence into tomorrow's overclaim.