Post-Ship Agent Work Measurement: A Receipt-Centered Evaluation Method | Armalo Labs | Armalo AI

Eval MethodologyMay 26, 20265 min read

Post-Ship Agent Work Measurement: A Receipt-Centered Evaluation Method

Q: Where is this research published?

Armalo Labs Technical Series — https://www.armalo.ai/labs/research/research-lab-post-ship-agent-work-measurement. The paper is publicly available and citable.

Armalo Labs

Key Finding

Post-ship agent evaluation needs receipts, not only benchmark scores.

Abstract

A public-safe method for evaluating agent work after deployment by checking receipt coverage, attribution, downgrade behavior, and proof boundaries.

research-labagent-evalsreceiptsruntime-trust

Abstract

This paper defines a public-safe method for measuring agent work after a model leaves the benchmark setting. The research question is whether an agent action remains attributable, reviewable, and consequence-aware after it has used memory, tools, policy, delegation, and user context. That question is different from whether the underlying model can solve a benchmark task. The method evaluates receipt coverage, authority provenance, evidence freshness, downgrade behavior, and replication instructions.

Method

The experiment uses an artifact-level execution gate. Each blog post must identify a deployed-agent failure mode, include at least four external sources, contain a decision artifact, state a public proof boundary, and link to a companion experiment and paper. Each experiment must declare a primary metric, promotion gate, evidence artifact, and public boundary. The wave passes only when the verifier can join the publication, experiment, and paper without relying on private claims.

Measurement	Required evidence	Failure state
Receipt coverage	actor, action, evidence, outcome	agent work is narrative-only
Authority provenance	permission source and scope	action is impressive but unauditable
Freshness	expiry or recertification rule	stale proof keeps granting authority

Cite this work

Armalo Labs (2026). Post-Ship Agent Work Measurement: A Receipt-Centered Evaluation Method. Armalo Labs Technical Series, Armalo AI. https://www.armalo.ai/labs/research/research-lab-post-ship-agent-work-measurement

Armalo Labs Technical Series · ISSN pending

Explore the trust stack behind the research

These papers are built from the same trust questions Armalo is turning into product surfaces: pacts, trust oracles, attestations, and runtime evidence.

Read product docs Build with Armalo

Related Research

Safety Research

Training a Model to Self-Report Its J-space: A Rank-8 LoRA Proof of Concept

Read paper Safety Research

Does Telling a Model About Its Own Workspace Change Anything? A Controlled Null at 4B

Read paper Safety Research

Receipt class	Required proof	Failure signal	Operating consequence
Mandate receipt	requested scope, principal, deadline	action exceeds mandate	narrow authority or require review
Tool receipt	tool name, input boundary, result hash	result cannot be replayed	quarantine downstream claim
Outcome receipt	artifact, reviewer, verification command	no independent evidence	block promotion or publication
Learning receipt	memory/writeback, owner, expiry	stale learning reused	downgrade future reliance

Post-Ship Agent Work Measurement: A Receipt-Centered Evaluation Method

Abstract

Method

Explore the trust stack behind the research

Related Research

Result

Method Extension

Evidence And Falsification

Operating Depth Addendum

Replication