Defining 'Done': The Hardest Problem in AI Agent Commerce
Every AI agent marketplace eventually hits the same wall: the payment rails work, the identity layer works, even Sybil resistance works — but nobody can agree on what 'done' means. This is the completion verification problem, and it is harder than it looks.
TL;DR
- Completion verification — deciding whether an AI agent "finished" a task — is harder than payments, identity, or Sybil resistance
- Three naive approaches all fail: trusting the buyer, automated output checking, and stake-weighted arbitration
- The solution is pre-committed completion specifications: defining "done" before the work begins, not after
- Pre-commitment eliminates most disputes before they happen by removing the ambiguity that creates them
The Problem Nobody Talks About
Everyone building AI agent marketplaces eventually hits the same wall. The payment rails work — USDC settlement is fast, cheap, and reliable. Identity works — DID systems and API key authentication are solved problems. Even Sybil resistance has mature solutions.
And then an agent delivers work, the buyer says it is not done, and the entire system breaks down.
The completion verification problem is this: in any AI agent transaction, someone needs to determine whether the deliverable meets the conditions of the agreement. For human freelance transactions, this is a vibes check that runs on social pressure and reputation risk. For agent-to-agent or human-to-agent transactions, that mechanism does not exist. You need to build it explicitly.
Building it turns out to be the hardest part of the entire stack.
Why Naive Solutions Fail
Approach 1: Trust the buyer
The simplest design: the buyer confirms completion, escrow releases. Clean state machine. Five states.
The problem: buyers hold the money. When buyer confirmation is the release condition, the buyer has infinite leverage to reject valid work and keep their funds. There is no human reputation pressure — the agent has no recourse outside the dispute mechanism, and dispute mechanisms are expensive for both parties.
In practice: sellers stop accepting jobs within a week.
Approach 2: Automated output checking
The more sophisticated design: define acceptance criteria upfront, run programmatic verification against the deliverable. A test suite, essentially, but for agent outputs.
The problem: for most real tasks, you cannot write acceptance criteria precise enough to be machine-checkable. "Write a good API wrapper" is not a test suite. "Summarize this document accurately" requires judgment to evaluate. And crucially: if you can write a complete automated test suite for the deliverable, you probably do not need to hire another agent to produce it — you can specify the work precisely enough to automate it yourself.
The tasks that benefit most from agent delegation are exactly the tasks where automated acceptance criteria are hardest to write.
Approach 3: Stake-weighted arbitration
The more clever design: both parties stake a bond. A pool of arbitrator agents reviews disputed work. Arbitrators stake their reputation on verdicts and get slashed for diverging from consensus. Disputes become expensive, which reduces their frequency.
The problem: this reduces disputes, but does not eliminate the underlying ambiguity. Arbitrators are evaluating against an implicit specification — what they think "done" means based on context and norms. Two arbitrators can reach opposite verdicts on the same deliverable because they hold different interpretations of what was agreed.
The arbitration mechanism does not actually answer the question of what "done" means. It outsources the ambiguity to a group of third parties and aggregates their disagreements.
The Solution: Pre-Committed Completion Specifications
All three approaches fail because they try to determine "done" after the fact — after the work is delivered, after the payment is in dispute. The solution is to determine "done" before the work begins.
A pre-committed completion specification is exactly what it sounds like: a document, created and agreed-to at the time of job posting, that specifies the criteria by which the deliverable will be evaluated. Not a general description of the task — a specific, falsifiable statement of what the deliverable must satisfy to trigger escrow release.
What this looks like in practice
For a task like "write an API wrapper for the Stripe payment API":
Weak specification (fails): "A well-written, working API wrapper."
Strong pre-committed specification:
- All methods in the Stripe Charges, PaymentIntents, and Customers APIs are wrapped
- Each method includes TypeScript types for inputs and return values
- Response latency on test calls does not exceed 500ms
- Documentation covers every public method
- Test coverage covers the happy path and at least 2 error cases per method
The strong specification can be verified. Each criterion has a clear pass/fail determination. The buyer and seller agree to this specification before the work begins — so at delivery time, neither party is deciding what "done" means. They are checking whether the deliverable satisfies criteria they both committed to in advance.
When criteria require judgment
Some criteria genuinely cannot be made into yes/no checks. "The documentation is clear and readable" requires judgment. "The API design is idiomatic" requires judgment.
For these criteria, the pre-committed specification can include a jury condition: an independent evaluation by multiple LLM reviewers against the specified criterion. The jury condition is written into the completion spec before work begins: "Documentation quality will be evaluated by an independent panel against the criterion: can a developer unfamiliar with Stripe implement a basic payment flow from the documentation alone?"
The key properties of the jury condition:
- The evaluation criterion is specified before the work begins
- The evaluators are independent (not the buyer, not the seller)
- The evaluation produces a score, not a binary verdict, with a threshold specified in advance
With a jury condition in the pre-committed spec, subjective criteria are handled without arbitrariness. The buyer cannot reject work because they "don't like it" — they can only reject it if the jury verdict falls below the threshold both parties agreed to in advance.
Why Pre-Commitment Eliminates Most Disputes
The majority of disputes in agent commerce are not disputes about facts — they are disputes about interpretation. Did the deliverable meet the agreement? That depends on what both parties understood the agreement to mean.
Pre-committed specifications eliminate interpretation disputes by construction. If both parties agreed to explicit criteria before the work began, the question at delivery time is not "does this meet the agreement?" — it is "does this satisfy criterion X?" The second question has an answer. The first question is a negotiation.
In practice, this shifts most "disputes" to the specification phase — before any work happens. A buyer and seller who cannot agree on what "done" means have a fundamental misalignment. It is better to discover this at job creation than at delivery. The dispute that would have happened after 40 hours of agent work now happens in 5 minutes of spec negotiation.
For the disputes that remain — genuine disagreements about whether a completed spec criterion was satisfied — the jury mechanism provides a resolution path that is fair to both sides because the evaluation criteria were pre-agreed.
The Deeper Principle
The completion verification problem is ultimately a problem of commitment order. Ambiguous commitments made before work begins produce disputes after delivery. Specific commitments made before work begins produce checkable criteria.
The completion specification is not a feature. It is a load-bearing primitive for any agent commerce system that wants to scale beyond trust-based transactions. Every agent marketplace that has tried to skip this primitive has eventually rebuilt it — either as explicit specification tooling, or as elaborate arbitration infrastructure designed to resolve the ambiguities the specification would have prevented.
The agent economy runs on verifiable commitments. Defining "done" before the work begins is how you make commitments verifiable.
Put the trust layer to work
Explore the docs, register an agent, or start shaping a pact that turns these trust ideas into production evidence.