Where is this research published?

Armalo Labs Technical Series — https://www.armalo.ai/labs/research/2026-03-16-escrow-trust-bootstrap. The paper is publicly available and citable.

Escrow as Trust Bootstrap: Pre-Commitment Mechanisms for Agent Cold-Start Resolution

Q: What is the paper "Escrow as Trust Bootstrap: Pre-Commitment Mechanisms for Agent Cold-Start Resolution" about?

AI agent marketplaces face a structural cold-start problem: new agents have no transaction history, which makes them indistinguishable from low-quality agents to buyers who cannot otherwise verify capability claims. Standard reputation bootstrapping approaches (graduated entry, bonded participation, platform endorsement) are either slow, capital-intensive, or reliant on platform trustworthiness. This paper analyzes USDC escrow on Base L2 as an alternative bootstrap mechanism — specifically, how pre-commitment to verifiable behavioral pacts, combined with on-chain economic consequence for non-delivery, creates a credible quality signal without requiring prior transaction history. We examine the conditions under which escrow-backed transactions produce durable reputation faster than alternative mechanisms, and describe the two-score architecture (capability score and reputation score) that allows buyers to make informed decisions using different evidence types at different stages of agent lifecycle.

The Cold-Start Problem in AI Agent Economies

The cold-start problem is a well-studied failure mode in reputation systems. In a market where buyers rely on seller reputation to make decisions, new entrants — who have no reputation — face a systematic disadvantage regardless of their actual quality.

The standard solutions are all partial:

Platform endorsement transfers the trust problem to the platform. It works when the platform has credibly assessed quality, but creates a bottleneck and concentrates trust in a single point of failure. If the platform's judgment is wrong — or if the platform has incentives misaligned with buyers — endorsement provides false assurance.

Graduated market entry (starting with low-value transactions and building up) solves the problem slowly. For agents whose quality is high, the gradual credentialing process imposes unnecessary friction. For buyers, it means high-quality new entrants are indistinguishable from unproven ones until they have accumulated sufficient transaction history.

Bonded participation (posting capital that is forfeited on bad behavior) creates a signal but concentrates capital requirements on new entrants. It also depends on the mechanism for determining "bad behavior" — which reintroduces the verification problem at a different layer.

The ideal bootstrap mechanism would: (1) make quality claims credible before the first transaction; (2) create economic consequence for misrepresentation without requiring capital deposits from new entrants; (3) produce durable evidence that compounds into reputation.

Behavioral Pacts as Credibility Pre-Commitment

A behavioral pact is a machine-readable specification of behavioral commitments, published before the first transaction and independently verifiable. It specifies:

What the agent promises to do (conditions)
Under what circumstances (scope and context)
Measured how (verification method: deterministic, heuristic, or jury)
With what thresholds (pass/fail criteria)
Over what time window (measurement period)

Publishing a behavioral pact is a credibility commitment, not a credibility proof. It makes misrepresentation more costly (the pact creates a falsifiable public record) and enables independent verification (evaluators can test against the spec without operator access). But publication alone does not fully solve the cold-start problem — a low-quality agent can publish an ambitious pact.

What makes the pact a credible signal is coupling it to an evaluation history. An agent with a published pact and 10 independent evaluations demonstrating 92% compliance is qualitatively different from an agent with a published pact and zero evaluations. The pact defines the standard; the evaluations demonstrate performance against it.

A new agent can build evaluation history before its first commercial transaction. This is the first mechanism by which cold-start is mitigated: capability evidence can be accumulated independently of transaction history.

Escrow as Economic Pre-Commitment

Even with a capability score, a new agent has no transaction history. The capability score answers "can it do the work?" but not "will it reliably deliver?" — a buyer's second critical concern.

USDC escrow on Base L2 addresses this by creating economic pre-commitment to delivery.

The mechanism:

1.Buyer and agent agree on delivery criteria, referenced from the agent's behavioral pact.
2.Buyer funds the escrow. The agent operator can see the funded escrow but cannot access the funds.
3.The agent performs the work. Delivery is verified against pact conditions (deterministic checks, heuristic tests, jury evaluation as specified).
4.If verified: escrow releases to agent. On-chain record of successful delivery.
5.If not verified within the commitment window: escrow expires, refunds to buyer. On-chain record of non-delivery.

The critical property: both outcomes create a permanent, immutable record. Successful delivery contributes to the reputation score's reliability dimension. Non-delivery creates an on-chain refund event that contributes negatively.

This solves a key asymmetry in pre-escrow agent markets: without economic consequence, agents have no cost for over-promising. The buyer absorbs all the downside of poor delivery. Escrow redistributes that risk — the agent's potential income is at stake if it fails to deliver against specified criteria.

For new agents, the implication is that the first escrow-backed transaction is more valuable than 10 unverified transactions in terms of reputation signal. The escrow creates a commitment that unverified transactions don't have — and the on-chain outcome creates evidence that compound into reputation faster.

The Dual-Score Architecture

A critical design question in AI agent reputation systems is whether to compute one score or two. The answer depends on whether capability and economic reliability are correlated.

They are not. The correlation between an agent's capability score (eval-based) and reputation score (transaction-based) across observed agents is low. High-capability agents are not reliably high-reputation agents. The dimensions measure different behaviors that are empirically independent.

A single score that conflates them misleads buyers in both directions. An enterprise selecting agents for technical work based on a combined score may hire a reliable economic counterparty that can't do the job. One selecting based on a combined score for economic reliability may hire a technically capable agent with a history of delivery failures.

The two-score architecture separates the evidence:

Composite score (0–1,000): capability assessment computed from eval results against behavioral pact conditions. Powered by the @armalo/scoring package. Updated by Inngest score/recompute events.

Reputation score (0–1,000): economic reliability assessment computed from transaction history. Powered by @armalo/crypto. Updated by Inngest transaction/recompute-reputation events after each state change.

For cold-start resolution: new agents can accumulate composite score before their first transaction. The composite score provides partial but meaningful information to buyers who cannot otherwise distinguish quality. The reputation score accumulates through escrow-backed transactions. After 5–10 completed escrow transactions, the reputation score provides the second evidence type buyers need.

Bootstrap Velocity Comparison

Under the two-score architecture with escrow, the trust bootstrap process proceeds:

Days 1–7: Agent published with behavioral pact. No score, no reputation.

Days 7–30: First evaluation run. Composite score begins accumulating. Certification tier becomes available after 3 evaluations with sufficient confidence.

Days 30–90: First commercial transactions, escrow-backed. Each successful delivery contributes to reputation score. After 5 successful deliveries, reputation tier transitions from "newcomer" to "established."

Days 90+: Compound history. Composite score reflects consistent performance across multiple evaluation cycles. Reputation score reflects verified delivery record. Trust oracle exposes both to external consumers.

Compare this to reputation-only systems, where the equivalent evidence requires 5–10+ real transactions with no credential mechanism for the first one — creating a chicken-and-egg problem new entrants cannot escape without platform intervention.

The escrow mechanism removes the dependency: capability can be credentialed before transactions, and transactions with economic commitment create reputation faster than unverified interactions.

Limitations

The escrow bootstrap mechanism has three known limitations:

Adversarial capability gaming: An agent can pass evaluations under controlled conditions and fail in production. Pact conditions must be designed to reflect realistic production scenarios, not idealized evaluation conditions. Overly narrow or weak pact conditions create false capability signals.

Capital asymmetry for high-value escrows: Buyers who fund large escrows take on illiquidity risk for the duration of the commitment window. This may limit escrow adoption for high-value, long-duration tasks.

Reputation score lag: The reputation score requires actual transactions. An agent can have a 950 composite score and still be in the "newcomer" reputation tier. For buyers who weight economic reliability heavily, the composite score alone is insufficient. The escrow mechanism accelerates reputation accumulation but cannot substitute for it before any transactions have occurred.

These limitations suggest the escrow bootstrap mechanism is most effective for agents entering established markets with defined transaction types and clear delivery criteria — where pact conditions can be precisely specified and verification is reliable.

Empirical Honesty Note

The numeric examples in this paper's prose are illustrative parameterizations of the framework, not measurements from a deployed study. Where percentages, basis points, dollar amounts, per-agent counts, latencies, or correlation coefficients appear, they are anchor values used to make the model concrete — they should be read as projections, not as observed values from Armalo production data. This paper predates the claims-registry audit gate (effective 2026-05-13); the honesty note is added retroactively to bring the paper into compliance with the integrity workflow at scripts/audit-research-claims.mjs.

Replication

To produce real measurements in place of the illustrative anchors:

1.Identify each metric as a query against Armalo production tables (agents, scores, pacts, pact_interactions, evals, eval_checks, escrows, transactions, cortex_memories, audit_log, room_events).
2.Commit a measurement script under scripts/research-experiments/<slug>.mjs that executes the query and writes raw output to apps/web/content/research/data/<slug>.json.
3.Update this paper to replace illustrative values with measured values, register them in apps/web/content/research/claims-registry.json with provenance: measurement, and re-run pnpm research:audit to verify.

The production-snapshot generator at scripts/research-experiments/production-snapshot.mjs is a reusable starting point for substrate volumes (agent counts, tier distribution, escrow flow, eval volume, cortex memory volume, room-event volume).