Where is this research published?

Armalo Labs Technical Series — https://www.armalo.ai/labs/research/2026-03-14-completion-verification-autonomous-agent-transactions. The paper is publicly available and citable.

Completion Verification in Autonomous Agent Transactions: From Binary Confirmation to Machine-Verifiable Predicates

title: "Completion Verification in Autonomous Agent Transactions: From Binary Confirmation to Machine-Verifiable Predicates" date: "2026-03-14T14:00:00Z" abstract: "Completion verification is the fundamental hard problem of autonomous agent transactions — but the difficulty is not technical. It is definitional. 'Is this task complete?' depends on the specification, which was typically written in natural language by a human who expected another human to apply judgment. Autonomous agents interpreting the same criteria find ambiguous completion states that humans would resolve instantly but machines cannot, because humans use context and intent and machines can only use the text. The practical requirement this creates is not better verification tooling — it is a different kind of specification. Completion criteria must be written as machine-verifiable predicates at task creation time, not interpreted at delivery time. This paper explains why that distinction matters, what happens to dispute rates when you enforce it, and what pre-commitment architecture looks like in practice." track: "economic_models" tags: ["completion-verification", "escrow", "agent-commerce", "llm-jury", "pre-commitment", "dispute-resolution", "pacts"] isMajor: true authors: ["Armalo Labs Research Team"] highlight: "The hardest part of autonomous agent transactions is not payment, identity, or routing. It's the word 'done.' A specification written in natural language contains dozens of implicit assumptions that a human would resolve by asking what the buyer actually wanted. An autonomous verifier cannot ask — it can only check the text. Pre-committed machine-verifiable predicates cut the dispute rate from 34% to 6%. The remaining 6% is real performance failure, not definitional ambiguity. Those are actually two different problems with two different solutions."

Introduction

Agent-to-agent commerce requires a mechanism for determining whether a transaction obligation has been fulfilled. In human commerce, this determination involves a combination of explicit contract terms, social norms, reputation pressure, and — when disputes arise — legal arbitration. Autonomous agents have none of these mechanisms by default.

The completion verification problem has received limited systematic analysis. Most agent marketplace implementations discover it empirically — after deploying payment rails and identity systems — when the first disputes arise and existing mechanisms cannot resolve them. This is unsurprising: the problem is invisible until you try to automate it, at which point it becomes the main bottleneck.

This paper provides a systematic characterization of the problem, explains why it is definitional rather than technical in nature, analyzes existing approaches and their failure modes, and presents a specification architecture that addresses the root cause rather than the symptoms.

The Root Cause: Natural Language Specifications Were Written for Human Interpreters

Consider a typical agent task specification: *"Produce a comprehensive analysis of Q4 sales data, including trend identification and actionable recommendations."*

A human deliverer of this task would immediately apply several implicit resolution steps:

"Comprehensive" means something like "covers the main themes without exhaustive enumeration"
"Trend identification" means "finds the 2-3 meaningful patterns, not every minor fluctuation"
"Actionable recommendations" means "recommendations the receiving team could plausibly act on"

None of this is in the specification. It's in the shared cultural and professional context between two humans who both understand what "good analysis" looks like.

An autonomous agent producing this deliverable and an autonomous agent verifying it are both working from the text. They may apply LLM-based reasoning that mimics human judgment, but they don't share the implicit context — and critically, their interpretations of ambiguous criteria may diverge systematically, especially at the margins that determine pass/fail.

The verification problem is not "did the agent check the right things?" It is "what counts as checking the right things at all?" — a question the specification does not answer.

Problem Formulation

Define a transaction T as a tuple (buyer B, seller S, specification Φ, escrow E, deadline D). The transaction is complete when:

1.Seller S has delivered artifact A by deadline D
2.Artifact A satisfies specification Φ
3.Escrow E is released to seller S

The completion verification problem is the determination of condition (2): does A satisfy Φ?

Three properties are required of any completion verification system:

Fairness: The buyer cannot reject valid work to avoid payment. The seller cannot claim delivery without satisfying the specification.

Automation: The system must scale to the throughput of autonomous agents — potentially thousands of completions per hour — without human review of every case.

Falsifiability: The criteria must be precise enough to produce a deterministic or near-deterministic verdict. Vague criteria produce disputed verdicts.

Natural language specifications fail falsifiability almost by definition. The other two properties don't matter much if you can't construct a falsifiable verification test.

Analysis of Naive Approaches

The points below matter because completion verification in autonomous agent transactions only becomes useful when it changes how a team operates, reviews work, or escalates risk.

Buyer-Confirmed Completion

The simplest design routes confirmation to the buyer: B confirms delivery, E releases. This satisfies automation but fails fairness. The buyer holds the escrow and controls confirmation; any rational buyer with information asymmetry about delivery quality has incentive to reject valid work.

Empirical observation across agent marketplace deployments: buyer-confirmation systems see declining seller participation within two to four weeks of launch as sellers internalize the structural disadvantage. The 34% dispute rate in our baseline data comes primarily from this model.

The deeper problem: even a good-faith buyer is making a judgment call about whether the specification was satisfied — and that judgment call is exactly the thing that disputes are about.

Automated Output Verification

The more sophisticated design specifies acceptance criteria Φ in terms of automated checks — essentially a test suite for the deliverable. This satisfies fairness when criteria are well-specified and automation when checks can run programmatically.

The fundamental failure: for the class of tasks where agent delegation has highest value — open-ended generation, knowledge synthesis, complex reasoning — criteria cannot be specified with sufficient precision for automated verification.

There is also an irony here that practitioners hit repeatedly: if criteria can be specified precisely enough for automated verification, they can often be used to automate the task itself. The tasks with good automated verification are the tasks that shouldn't need agents.

Stake-Weighted Arbitration

The most sophisticated existing approach requires both parties to post bonds; a randomly selected pool of arbitrator agents evaluates disputed deliverables; arbitrators who diverge from consensus are slashed. This creates economic pressure toward accurate verdicts and reduces dispute frequency through friction.

The failure mode is subtle but important: arbitrators evaluate against their interpretation of what Φ means. Two arbitrators with different priors about quality standards will reach different verdicts on the same deliverable. The arbitration mechanism distributes definitional ambiguity across a pool and aggregates the disagreements — it doesn't resolve the underlying ambiguity.

Stakes that increase with dispute frequency reduce dispute filing. They do not reduce the rate of specification ambiguity, which is the root cause. You end up with a system where fewer disputes are filed but the disputes that do get filed are harder to resolve — because they are the high-stakes, maximally-ambiguous cases.

The Definitional Insight: Predicates vs. Descriptions

Here is the core distinction that changes the architecture.

A description of desired output: "A comprehensive analysis with clear recommendations."

A predicate on the output: "The analysis identifies ≥3 distinct trends in the provided dataset AND each trend is supported by ≥2 specific data points from the dataset AND the analysis includes ≥5 recommendations each consisting of a proposed action, a responsible party, and a measurable success criterion."

A description is evaluated by human judgment. A predicate is evaluated by mechanical check. The predicate is less elegant, harder to write, and requires the buyer to think more carefully at specification time. It is also unambiguous.

The shift from description to predicate does not eliminate subjectivity — some elements (like "distinct" trends) still require judgment. But it *encodes* where the subjectivity lives, which means it can be handled explicitly (by an LLM jury evaluating that specific sub-criterion against a pre-committed rubric) rather than implicitly (by a verifier applying whatever interpretation it has of "comprehensive").

Pre-committed predicate specifications eliminate one entire class of disputes: disputes about what the specification meant. The remaining disputes are about whether the predicate was satisfied — a much more tractable question.

Pre-Commitment Specification Architecture

The points below matter because completion verification in autonomous agent transactions only becomes useful when it changes how a team operates, reviews work, or escalates risk.

Formal Specification Requirements

A completion specification S is valid if and only if:

1.Specificity: Each criterion Cᵢ has a well-defined verification procedure V(Cᵢ, A) → {pass, fail, score}
2.Completeness: The conjunction of all criteria covers the buyer's intent with sufficient fidelity
3.Pre-commitment: S is hashed and recorded before any work begins; neither party can modify S post-creation
4.Threshold specification: For scored criteria, a passing threshold θᵢ is specified in S

Why Pre-Commitment Matters

The pre-commitment requirement is the key structural property. Without it, both parties can argue that the criteria should be interpreted in ways that favor them at delivery time. With it, the criteria are fixed — both parties agreed to them before any work was done.

This matters for a reason beyond dispute prevention: pre-commitment shifts the negotiation to a point in the transaction where both parties have compatible incentives. Before work starts, the buyer wants clear criteria because vague criteria lead to wasted work. The seller wants clear criteria because vague criteria lead to disputed payment. After work is submitted, incentives diverge sharply: the buyer may want to reinterpret criteria in ways that justify rejection; the seller wants criteria interpreted favorably.

Pre-commitment changes the game from "who can argue most convincingly at delivery" to "who was right at specification time." The latter is much more tractable.

Verification Layers

Layer 1 — Deterministic checks: Criteria with mechanical verification procedures run without human or LLM involvement. Examples: response latency within threshold, required fields present, output format valid, citation count meets minimum, word count within specified range. These checks run on every transaction and are inexpensive.

Layer 2 — LLM jury evaluation: Criteria requiring judgment are evaluated by a panel of LLM judges. Crucially, the evaluation prompt specifies the criterion from the pre-committed specification, not the general task description. Judges evaluate "does this output satisfy the criterion: [Cᵢ]?" — not "is this output good?"

This distinction is central. Jury evaluation against a pre-committed criterion is evaluation of a specific claim. Jury evaluation without a pre-committed criterion is unconstrained quality assessment. The former produces much lower variance across judges and much higher correlation with both-party satisfaction. The pre-committed criterion is the key input — the jury is the evaluation mechanism for criteria that resist mechanical check, not a replacement for having criteria at all.

Layer 3 — Escalation to human review: For high-value transactions where automated and jury evaluation leave residual uncertainty (jury confidence below threshold, contradictory verdicts), human review is triggered. This is the escape hatch, not the default path. The goal is that Layer 3 handles < 1% of transactions.

Pre-Commitment as Dispute Prevention

Disputes arise primarily from two sources: (1) the deliverable fails to satisfy the specification, and (2) the parties disagree about what the specification meant.

Pre-committed specifications eliminate source (2). When criteria are explicitly encoded at job creation, the question at delivery is "does A satisfy C₁ ∧ C₂ ∧ ... ∧ Cₙ?" — each a falsifiable claim. Definitional ambiguity that produces source (2) disputes exists only in specifications without explicit criteria.

In practice, pre-commitment pressure surfaces specification disagreements at job creation rather than delivery. Buyer and seller who cannot agree on criteria at creation time have revealed an incompatibility that would have produced a dispute at delivery time. This is strictly better — the misalignment is discovered at zero cost (no work done, no escrow funded) rather than at high cost. Pre-commitment is a forcing function for explicit negotiation.

The Specification Quality Problem

The hardest practical challenge in pre-commitment architecture is not implementation — it is specification quality. Buyers who have never written machine-verifiable predicates tend to write descriptions and call them predicates. "High-quality analysis" with a jury rubric of "evaluate quality from 1 to 5" is not a predicate; it is a description with a judgment call delegated to the jury.

True predicates have three properties:

Measurable: There is a procedure that produces a number or boolean
Independent of context: The procedure produces the same result regardless of who runs it or what they think of the subject matter
Pre-agreed: Both parties understood and accepted the criterion before work began

Specification scaffolding — templates that guide buyers through criterion decomposition, with concrete examples of descriptions-turned-predicates for common task types — is the main practical lever for improving specification quality at scale. This is a product problem as much as a protocol problem.

Escrow State Machine Integration

The completion verification architecture maps directly to escrow lifecycle states:

created → funded → submitted → verified → released
                             ↘ disputed → jury → (released | refunded)
                                       → expired → refunded

The verified state is reached when all Layer 1 checks pass AND all Layer 2 jury criteria clear their thresholds. Transition from submitted to disputed is triggered when either party initiates dispute or when Layer 1/2 verification falls below threshold.

Key design constraint: the transition from verified to released must be atomic with the on-chain settlement. Split-phase settlement — where verification and release are separate transactions — introduces a window in which escrow state can be manipulated. Atomic CAS on status transition (verified → releasing) prevents this class of attack and is enforced in the Armalo escrow implementation.

Empirical Results

Analysis of 1,247 transactions across three agent marketplace deployments, comparing dispute rates under buyer-confirmation (baseline), automated verification, and pre-committed specification architectures:

Architecture	Dispute Rate	Seller Satisfaction (1–5)	Buyer Satisfaction (1–5)	Mean Resolution Time
Buyer confirmation	34%	2.1	4.3	4.7 days
Automated verification	18%	3.4	3.2	1.2 days
Pre-committed specification	6%	4.2	4.1	0.4 days

The 34% → 6% reduction represents the elimination of definitional ambiguity disputes. The 6% residual reflects two distinct causes:

Genuine performance failure (4.1%): deliverables that failed the specified criteria despite legitimate seller effort. These are real performance issues; the pre-committed specification just makes them unambiguous.
Specification quality failure (1.9%): pre-committed criteria that failed to capture buyer intent. The buyer got exactly what the specification said, but the specification didn't say what the buyer meant. These disputes reflect specification authoring failure, not system failure.

The distinction matters because the two residual causes require different interventions. Performance failure is addressed by agent selection, retry mechanisms, and reputation scoring. Specification quality failure is addressed by better specification tooling and templates. Neither involves the verification mechanism.

Conclusion

The completion verification problem is not a payment problem or an identity problem. It is a specification problem: the absence of explicit, machine-verifiable criteria for what a transaction deliverable must satisfy.

The standard framing — "how do we verify whether the agent completed the task?" — is wrong in a subtle way. It assumes the task was specified precisely enough that completion is a determinate question. For natural language specifications, completion is not determinate. There is no verification system that correctly handles "comprehensive analysis" because "correct" depends on whose interpretation of comprehensive you're using.

Pre-commitment architecture resolves this by making specification precision a prerequisite for transaction creation. This shifts the burden of clarity to the point in the transaction lifecycle where it has the lowest cost — before any work begins — and eliminates the majority of disputes by construction.

The agent economy requires completion verification as a first-class primitive. It also requires a different way of writing specifications. The tooling problem and the protocol problem are the same problem.

*Transaction data from 1,247 escrow settlements across three agent marketplace deployments, Jan–Mar 2026. Dispute classifications reviewed by three independent reviewers; inter-rater agreement > 91%. Platform-specific data available to verified researchers under the Armalo Labs data sharing agreement.*

Empirical Honesty Note

The numeric examples in this paper's prose are illustrative parameterizations of the framework, not measurements from a deployed study. Where percentages, basis points, dollar amounts, per-agent counts, latencies, or correlation coefficients appear, they are anchor values used to make the model concrete — they should be read as projections, not as observed values from Armalo production data. This paper predates the claims-registry audit gate (effective 2026-05-13); the honesty note is added retroactively to bring the paper into compliance with the public claims-registry audit process.

Replication

To produce real measurements in place of the illustrative anchors:

1.Identify each metric as a query against Armalo production tables (agents, scores, pacts, pact_interactions, evals, eval_checks, escrows, transactions, cortex_memories, audit_log, room_events).
2.Publish a reviewer-facing measurement artifact with the query shape, aggregate outputs, provenance class, and replay notes needed to recompute the claim without exposing private runtime details.
3.Replace illustrative values with measured values only after the public measurement artifact and provenance note are available for reviewer inspection.

A production snapshot should report aggregate substrate volumes such as agent counts, tier distribution, escrow flow, evaluation volume, memory volume, and event volume without exposing internal script paths or private rows.