Post-Ship Agent Work Measurement: A Receipt-Centered Evaluation Method | Armalo Labs | Armalo AI
Eval MethodologyMay 26, 20265 min read
Post-Ship Agent Work Measurement: A Receipt-Centered Evaluation Method
Armalo Labs
Key Finding
Post-ship agent evaluation needs receipts, not only benchmark scores.
Abstract
A public-safe method for evaluating agent work after deployment by checking receipt coverage, attribution, downgrade behavior, and proof boundaries.
research-labagent-evalsreceiptsruntime-trust
Abstract
This paper defines a public-safe method for measuring agent work after a model leaves the benchmark setting. The research question is whether an agent action remains attributable, reviewable, and consequence-aware after it has used memory, tools, policy, delegation, and user context. That question is different from whether the underlying model can solve a benchmark task. The method evaluates receipt coverage, authority provenance, evidence freshness, downgrade behavior, and replication instructions.
Method
The experiment uses an artifact-level execution gate. Each blog post must identify a deployed-agent failure mode, include at least four external sources, contain a decision artifact, state a public proof boundary, and link to a companion experiment and paper. Each experiment must declare a primary metric, promotion gate, evidence artifact, and public boundary. The wave passes only when the verifier can join the publication, experiment, and paper without relying on private claims.
Measurement
Required evidence
Failure state
Receipt coverage
actor, action, evidence, outcome
agent work is narrative-only
Authority provenance
permission source and scope
action is impressive but unauditable
Freshness
expiry or recertification rule
stale proof keeps granting authority
Cite this work
Armalo Labs (2026). Post-Ship Agent Work Measurement: A Receipt-Centered Evaluation Method. Armalo Labs Technical Series, Armalo AI. https://www.armalo.ai/labs/research/research-lab-post-ship-agent-work-measurement
Armalo Labs Technical Series · ISSN pending
Explore the trust stack behind the research
These papers are built from the same trust questions Armalo is turning into product surfaces: pacts, trust oracles, attestations, and runtime evidence.
For this wave, execution means the publication system produced five posts, five experiments, and five public research papers, then ran a verifier over the linkage and quality gates. It does not claim that every future Armalo research lane is fully automated or that private production traces are public. The result is narrower and more useful: the Research Lab authority claim now has a reproducible public artifact loop.
Method Extension
The measurement method treats post-ship agent work as a chain of receipts rather than a final answer. The chain starts with the agent mandate, continues through tool calls and handoffs, and ends with an outcome record that can be reviewed without trusting the agent's narration. This aligns with the broader research direction in long-horizon agent evaluation, including [SWE-bench](https://www.swebench.com/) style task completion scoring and emerging agent-memory benchmarks such as [MemoryArena](https://digitaleconomy.stanford.edu/publication/memoryarena-benchmarking-agent-memory-in-interdependent-multi-session-agentic-tasks/). Those benchmarks make the same pressure visible: a system is only useful if later reviewers can tell what changed, why it changed, and whether the evidence survived the workflow.
The reusable framework is a receipt coverage matrix:
Receipt class
Required proof
Failure signal
Operating consequence
Mandate receipt
requested scope, principal, deadline
action exceeds mandate
narrow authority or require review
Tool receipt
tool name, input boundary, result hash
result cannot be replayed
quarantine downstream claim
Outcome receipt
artifact, reviewer, verification command
no independent evidence
block promotion or publication
Learning receipt
memory/writeback, owner, expiry
stale learning reused
downgrade future reliance
Evidence And Falsification
The first execution in this wave is intentionally modest: it checks whether a public artifact set can prove that posts, papers, and experiments are linked and reviewable. The method would be falsified if a post claims an experiment exists but the evidence artifact cannot be found, if an experiment names no metric, or if a paper omits the boundary that keeps public claims honest. A stronger future execution should run the same matrix over live agent actions, not just publication artifacts, and should report receipt coverage as a percentage of completed actions.
The buyer implication is concrete. A procurement or platform team should not ask only whether an agent completed a task. It should ask whether the task can be reconstructed after deployment pressure. If the answer is no, the evaluation is still pre-production theater.
Operating Depth Addendum
In practice this method should be run as a small review meeting, not as a content checklist. Pick one shipped agent action, collect the mandate, tool receipts, final artifact, reviewer decision, and any learning writeback, then mark which links are independently replayable. The review should produce one of three outcomes: promote the workflow, narrow the workflow, or keep the workflow live but require a missing receipt before broader autonomy. That outcome discipline is what separates post-ship measurement from a pleasant dashboard.
The same paper should be updated when the verifier changes. If the verification command begins checking runtime actions instead of publication artifacts, the paper should name that stronger proof and soften any claim that depended only on static files. A public lab artifact earns trust by preserving that boundary over time.
Replication
This is a framework paper: its quantitative content is the structure of the measurement and receipt-coverage matrices plus the artifact counts and gate thresholds of the publication wave it describes. The wave's five posts, five experiments, and five papers are committed artifacts, and the wave verifier enforces the four-external-source minimum, schema validity, linkage, and word-count gates before release. To replicate the method itself, run the receipt coverage matrix over one shipped agent action and record which receipt classes are independently replayable. Every numeric claim in this paper is registered in Armalo's research claims registry with an explicit provenance type.
Proof Debt Is the New Technical Debt: A Ledger for Agent Research Claims