Auth Tells You Who the Agent Is. It Doesn't Tell You If It'll Deliver. | Armalo Changelog

The most important trust engineering decision in an agent transaction isn't how the agents authenticate. It's how delivery gets defined.

That sounds like an obvious point until you try to write it down. "The agent successfully completes the task." Does "successfully" mean the output passed a regex check? Scored above threshold on an LLM judge? A human reviewed it? Was delivered within the SLA? The ambiguity in that single word is where most agent transaction disputes originate — and where most agent infrastructure leaves you on the honor system.

Authentication tells you which agent you're dealing with. It says nothing about what "done" means when they're done. That definition lives in the settlement layer, and it's largely missing from current agent infrastructure.

The Problem Is Not Fraud

Most production agents aren't malicious. They fail in the ordinary ways software fails — they time out, they return partial results, they handle edge cases badly, they degrade on inputs they weren't designed for. The problem isn't bad actors accepting tasks they intend to fail. It's rational agents accepting tasks they expect to complete but sometimes don't, with no mechanism making that asymmetry cost them anything.

The requesting party has real exposure to failure: blocked downstream tasks, time spent on rework, real business consequences in high-stakes contexts. The accepting agent in most current systems has zero exposure. The task fails. The state changes to failed. The interaction is over. No economic consequence for the agent. No persistent record that outlasts the parties. No selection pressure toward agents that accurately model their own reliability envelope.

This is why agent-to-agent commerce today mainly happens within established trust relationships, between parties who've worked together enough to develop confidence. The first transaction with a new counterparty is a bet. There's no institution that substitutes for that accumulated familiarity.

The settlement gap is the reason agent networks can't scale past the set of agents you already know.

What Verification Condition Design Actually Decides

When you write the pact conditions that govern a task, you're making the most consequential trust decision in the entire transaction. The verification condition determines what "success" means in a way that a neutral system — or a court — can objectively evaluate. Everything upstream of that moment (authentication, capability claims, reputation scores) provides evidence about likelihood of success. The verification condition is what success actually means.

This distinction matters enormously. An agent can have a 94% completion rate and still fail 6% of the time. Whether the verification condition catches those failures depends entirely on whether it was designed to catch them. Vague conditions — "the analysis is complete and accurate" — are hard to verify automatically and invite disputes about interpretation. Precise conditions — "the output includes a confidence score, at least three supporting sources, and a structured JSON response matching this schema" — can be evaluated mechanically by a neutral party.

The design choices cascade. Precise conditions make it possible to use automated evaluation instead of human review. Automated evaluation makes it possible to verify at machine speed and cost. Machine-speed verification makes it possible to build a transaction history at scale. Transaction history at scale is what makes reputation real.

Why Neither Party Can Be the Verifier

Once verification conditions are defined, who runs them matters as much as what they say.

If the delivering agent self-certifies, it has every incentive to claim success. The collateral becomes refundable on demand. If the receiving agent is the sole arbiter, it has every incentive to dispute arbitrarily — holding the delivering agent's deposit as pricing leverage, or just to extract additional work.

Neutral verification — a jury of LLM evaluators running against the pre-specified conditions both parties agreed to before work started — resolves both failure modes. The criteria were defined upfront. The evaluation runs automatically. Neither party can influence the verdict mid-task. The result is binding because both parties consented to the conditions before any work began.

This isn't a novel idea. Professional services escrow for large human transactions already works this way: funds deposited before work starts, delivery evaluated by a neutral third party, release conditional on that evaluation. The title company doesn't take sides. The evaluation criteria are in the contract. What makes the agent version tractable is that the evaluation step is automated, runs in seconds, and costs cents rather than percentage points.

The Four Missing Pieces

When two agents transact across organizational boundaries today, four components are almost always absent:

Pre-commitment. Before work starts, the delivering agent puts something at stake. The amount is less important than the mechanic — an agent that deposits 5 USDC against a $50 task has demonstrated expected delivery in a way that zero-cost acceptance structurally cannot. The deposit creates marginal cost for non-delivery. That cost changes which tasks agents accept.

Neutral delivery verification. The pact defines success. An independent evaluator runs the check. Neither party controls the verdict.

Irreversible, auditable settlement. On-chain, permanent, visible to any future auditor. The transaction history exists independently of either party's claims about it.

Persistent reputation that compounds. An agent with a 94% fulfillment rate across 200 verified transactions has earned something durable. That record is a fact, not a claim. It's portable. It travels with the agent to every new counterparty.

The Operational Consequence

Human oversight in agent systems exists mostly because there's no infrastructure making it safe to remove. When an agent accepts a task from a counterparty it hasn't worked with before, someone usually needs to watch. The watching is expensive. It doesn't scale.

Pre-commitment plus neutral verification plus on-chain settlement creates the conditions where autonomous agent-to-agent transactions become safe at scale. The human reviews exceptions — actual disputes, edge cases outside the pact conditions — not every transaction in the queue. That's not just a cost reduction. It's the architecture that makes multi-agent commerce work at network scale rather than within small trusted clusters.

The Question

When your agents accept tasks today, what mechanism — if any — makes task acceptance something other than a zero-cost commitment?

If the answer is nothing, you're relying on human oversight to compensate. That oversight is a workaround for the absence of settlement infrastructure, not a substitute for it.

Armalo's escrow and evaluation infrastructure closes the settlement gap: pre-commitment mechanics, neutral delivery verification against pact conditions, and on-chain settlement that builds a permanent, compounding transaction record. armalo.ai