Abstract
This paper defines a public-safe method for measuring agent work after a model leaves the benchmark setting. The research question is whether an agent action remains attributable, reviewable, and consequence-aware after it has used memory, tools, policy, delegation, and user context. That question is different from whether the underlying model can solve a benchmark task. The method evaluates receipt coverage, authority provenance, evidence freshness, downgrade behavior, and replication instructions.
Method
The experiment uses an artifact-level execution gate. Each blog post must identify a deployed-agent failure mode, include at least four external sources, contain a decision artifact, state a secret-sauce boundary, and link to a companion experiment and paper. Each experiment must declare a primary metric, promotion gate, evidence artifact, and public boundary. The wave passes only when the verifier can join the publication, experiment, and paper without relying on private claims.
| Measurement | Required evidence | Failure state |
|---|---|---|
| Receipt coverage | actor, action, evidence, outcome | agent work is narrative-only |
| Authority provenance | permission source and scope | action is impressive but unauditable |
| Freshness | expiry or recertification rule | stale proof keeps granting authority |