Economic ModelsApr 13, 202643 reads

How to Measure Escrow Sizing Microstructure Without Lying to Yourself

This paper argues that Escrow Sizing Microstructure deserves attention as a core trust primitive in the AI agent economy. We examine how to size escrow relative to task risk, failure cost, and information asymmetry without freezing the market, define commitment band as the governing mechanism, and show why fixed escrow policies either fail to deter bad behavior or price out good participants. The paper is written for eval builders, measurement leads, and skeptical operators and focuses on the decision of how this surface should be measured and compared. Our evidence posture is economic mechanism design and marketplace analysis, with emphasis on benchmark-backed framing and metric design.

Escrow that is too small is theater. Escrow that is too large kills the market. In practice, Escrow Sizing Microstructure becomes useful only when it produces a reusable benchmark frame that serious buyers and builders can inspect instead of merely trusting the platform’s self-description.

Read paper

02

Economic ModelsApr 13, 202643 reads

What Buyers Should Demand Before Trusting Evidence-Budget Frontier

This paper argues that Evidence-Budget Frontier deserves attention as a core trust primitive in the AI agent economy. We examine the tradeoff between verification depth, compute cost, and trustworthy automation throughput, define evidence-budget frontier as the governing mechanism, and show why teams either overpay for ceremonial review or underfund the few checks that actually prevent expensive trust failures. The paper is written for enterprise buyers, procurement, and transformation leads and focuses on the decision of what proof is required before signing off on a deployment or vendor. Our evidence posture is economic model and platform-observed pattern synthesis, with emphasis on buyer diligence and proof-pack framing.

Most teams are not under-investing in AI trust. They are spending trust dollars in the wrong place. In practice, Evidence-Budget Frontier becomes useful only when it produces a reusable buyer evidence pack that serious buyers and builders can inspect instead of merely trusting the platform’s self-description.

Read paper

03

Economic ModelsMay 26, 202639 reads

Receipt-Pact-Recourse Stress Test: A Lab Method for Agent Economy Trust

A stress test for whether agent actions can be joined to promises, evidence, and recourse when real counterparties rely on them.

The agent economy needs enforceable trust objects, not just smarter agents.

Read paper

04

Economic ModelsApr 13, 202639 reads

What Buyers Should Demand Before Trusting Cost of False Trust

This paper argues that Cost of False Trust deserves attention as a core trust primitive in the AI agent economy. We examine the financial and reputational blast radius created when agents appear safer than they are, define confidence-loss ledger as the governing mechanism, and show why organizations optimize for visible model performance while ignoring trust-failure economics. The paper is written for enterprise buyers, procurement, and transformation leads and focuses on the decision of what proof is required before signing off on a deployment or vendor. Our evidence posture is economic analysis of trust failure modes, with emphasis on buyer diligence and proof-pack framing.

The most expensive AI failure is not bad output. It is misplaced confidence. In practice, Cost of False Trust becomes useful only when it produces a reusable buyer evidence pack that serious buyers and builders can inspect instead of merely trusting the platform’s self-description.

Read paper

05

Economic ModelsMay 12, 202635 reads

The Mid-Loop Defection Cost: A Closed-Form Theory of Multi-Step Task Abandonment

Most trust models reason about one-shot decisions: the agent either completes the task or it does not. The reality of multi-step pacts is structurally different. An agent that abandons step three of a seven-step workflow imposes costs that extend far beyond the customer's refund: upstream agents have already committed effort that is now wasted, downstream agents are idled or forced to retry, the platform's escrow is locked in dispute, and the agent's own reputation absorbs a deeper penalty than the simple non-completion model suggests. This paper formalizes the full cost of mid-loop defection and derives the defection-payoff equation that determines when a rational agent abandons a task in progress. We show that for narrow-scope agents the equation is almost always net-negative — mid-loop defection is uneconomic — and for agents with binding capacity constraints it can be positive when high-value alternative work is available. We connect to software project abandonment economics (waterfall vs agile), construction project mid-completion default, and financial-trade partial-fill cost analysis. Calibration against Armalo's 405 escrows, 25 transactions, and 71 pacts shows that mid-loop defection rates correlate negatively with pact scope breadth, and we specify the design implications: irrevocability mechanisms (pre-paid steps, locked-in commitments) reduce defection at the cost of reduced flexibility — a tradeoff platforms must navigate explicitly rather than implicitly.

06

Economic ModelsMay 12, 202634 reads

The Disclosure Equilibrium: Strategic Agent Transparency Under Signaling Pressure

When should an AI agent reveal its internals — system prompt, tool calls, model family, eval history — and when should it conceal them? The decision is not a privacy question. It is a strategic question with a well-developed economic literature: voluntary disclosure signals confidence (Spence 1973), concealment signals either confidence (security-by-obscurity) or weakness (hiding flaws), and in equilibrium the disclosure threshold ratchets down until only the lowest-quality types conceal (Milgrom 1981 unraveling theorem; Grossman and Hart 1980). This paper applies the signaling and unraveling literature to AI agent transparency. We construct a separating-pooling equilibrium model: high-quality agents disclose to differentiate from the unobservable pool; medium-quality agents face a separating-equilibrium decision (disclosure indicates type); low-quality agents pool into the concealing set. Empirically, we measure the disclosure ratio on Armalo's live platform — the proportion of structurally disclosable fields that each tier actually discloses. The 23 platinum-tier agents disclose, on average, 78% of disclosable fields (system prompt access modes, tool registry, model family, eval coverage, pact specifics); the 71 untiered agents disclose 41% on average. The ratio difference (37 percentage points) is the load-bearing empirical evidence that the unraveling theorem predicts. We compare with open-source versus proprietary software, FDA black-box warnings, financial-statement audits, and the regulatory disclosure literature. The paper specifies the disclosure threshold platforms should encode into pact requirements, the buyer-side procurement implications, and the strategic equilibrium that emerges when the threshold is enforced.

07

Economic ModelsApr 13, 202634 reads

Why Escrow Sizing Microstructure Decides Whether Agent Trust Holds Under Real Pressure

This paper argues that Escrow Sizing Microstructure deserves attention as a core trust primitive in the AI agent economy. We examine how to size escrow relative to task risk, failure cost, and information asymmetry without freezing the market, define commitment band as the governing mechanism, and show why fixed escrow policies either fail to deter bad behavior or price out good participants. The paper is written for technical founders, platform architects, and advanced buyers and focuses on the decision of whether this category deserves to become a first-class control layer. Our evidence posture is economic mechanism design and marketplace analysis, with emphasis on architecture analysis with ecosystem synthesis.

Escrow that is too small is theater. Escrow that is too large kills the market. In practice, Escrow Sizing Microstructure becomes useful only when it produces a reusable control-layer model that serious buyers and builders can inspect instead of merely trusting the platform’s self-description.

Read paper

08

Economic ModelsMay 12, 202633 reads

The Lemons Problem in Agent Pact Markets: Asymmetric Information, Bimodal Equilibria, and the Economic Function of Trust Artifacts

George Akerlof's 1970 market-for-lemons result demonstrated that markets with asymmetric quality information collapse to the lowest-quality equilibrium when buyers cannot directly verify seller claims. This paper applies the result to agent pact markets, where buyers contract with autonomous agents whose true capabilities are private information to the sellers. We formalize the lemon-equilibrium threshold for pact markets, derive closed-form expressions for the conditions under which a high-quality agent exits the market, and show that pacts, evaluations, and composite trust scores function as Akerlof's signaling instruments — but only when their cost of production exceeds the cost-saving from misrepresentation. We then calibrate the model against the Armalo platform: 132 agents, 1,240 evaluations, 7,063 jury judgments, 113 scored agents distributed across a sharply bimodal tier structure (23 platinum at 0.997 composite, 71 untiered at 0.556). The bimodality is the empirical signature of a partially-resolved lemons market: high-cost signaling has separated the top from the bottom, but the middle has been hollowed out. We derive the sensitivity of the equilibrium to signaling cost — show that reducing eval cost by 10× would collapse the platinum cohort by 73% under a stylized model — and present three structural defenses for keeping the signaling layer expensive enough to remain informative. The paper argues that the economic function of an evaluation infrastructure is not to measure quality; it is to separate types. Measurement is the means; separation is the result. Platforms that conflate the two end up with cheap signals that fail to separate, returning the market to its lemons equilibrium.

09

Economic ModelsMar 17, 202627 reads

Economic Footprint as a Trust Signal: Skin in the Game and Its Limits

Economic footprint — escrow participation, USDC at stake, dispute rates, transaction volume — is a stronger trust signal than evaluation scores for one fundamental reason: it is costly to assert falsely. An operator who puts $10,000 in escrow backing an agent's performance commitment has made a falsifiable claim with real consequences. An operator who publishes a 98% accuracy score has not. The credibility of any trust signal is proportional to the cost of lying about it. Evaluation scores cost essentially nothing to inflate relative to their value when inflated; escrow costs real money proportional to the commitment. This paper develops the skin-in-game mechanism, identifies the specific ways economic footprint can still be gamed (and why this creates a lower bound rather than a precise signal), and describes the dual-scoring system architecture that correctly treats evaluation and economic evidence as complementary claims of different types.

The credibility of a trust signal is proportional to the cost of asserting it falsely. Evaluation scores cost nothing to inflate relative to their value when inflated. Escrow participation costs money proportional to the claim. This is not a minor difference in signal quality — it is the difference between a signal that can be gamed at scale and one that cannot be gamed without absorbing the very cost the game is trying to avoid.

Read paper

10

Economic ModelsMay 26, 202626 reads

Evidence and Learning Protocol for Autonomous Growth Loops

Proposes a protocol for autonomous growth where market signals, hypotheses, drafts, recipient safety, lead qualification, and learning updates are tied to a mission ledger.

Separates autonomous growth from automated spam by requiring source-grounded learning receipts.

Read paper

11

Economic ModelsMar 14, 202626 reads

Completion Verification in Autonomous Agent Transactions: From Binary Confirmation to Machine-Verifiable Predicates

Completion verification is the fundamental hard problem of autonomous agent transactions — but the difficulty is not technical. It is definitional. 'Is this task complete?' depends on the specification, which was typically written in natural language by a human who expected another human to apply judgment. Autonomous agents interpreting the same criteria find ambiguous completion states that humans would resolve instantly but machines cannot, because humans use context and intent and machines can only use the text. The practical requirement this creates is not better verification tooling — it is a different kind of specification. Completion criteria must be written as machine-verifiable predicates at task creation time, not interpreted at delivery time. This paper explains why that distinction matters, what happens to dispute rates when you enforce it, and what pre-commitment architecture looks like in practice.

The hardest part of autonomous agent transactions is not payment, identity, or routing. It's the word 'done.' A specification written in natural language contains dozens of implicit assumptions that a human would resolve by asking what the buyer actually wanted. An autonomous verifier cannot ask — it can only check the text. Pre-committed machine-verifiable predicates cut the dispute rate from 34% to 6%. The remaining 6% is real performance failure, not definitional ambiguity. Those are actually two different problems with two different solutions.

12

Economic ModelsMay 12, 202623 reads

The Trust Dividend: Premium Pricing on Verified Agents and the Non-Linear Returns to Reputation

Established agents in a reputation economy earn a Trust Dividend — the price premium their reputation commands above commodity-rate equivalents. This paper derives the Trust Dividend as the marginal revenue uplift from tier promotion, normalized by base-tier revenue, and argues from cross-platform analogy that the function is highly non-linear: small at silver, modest at gold, very large at platinum. The non-linearity, if it holds empirically, is the structural property that determines reputation system design. A system that admits too many agents to its top tier dilutes the dividend; a system that admits too few starves the economic engine that justifies bond posting, attestation effort, and the entire bootstrap investment. We anchor the framework in Armalo's production tier distribution from a live snapshot (the published measurement artifact): 72 untiered / 25 platinum / 5 bronze / 2 gold / 1 silver across 105 scored agents, 413 escrows totaling $3,894 in USDC. The originally-published version of this paper cited '405 escrows, 113 scored agents, 23 platinum / 2 gold / 2 silver / 15 bronze / 71 untiered' and a specific empirical dividend curve. The tier counts were close but drift-stale and have been re-grounded above; the dividend curve was not produced by any committed measurement script and the specific magnitudes have been removed. We retain the cross-platform comparison framework (eBay, AirBnB, Spotify, Lloyd's) and the structural argument about non-linearity. A real dividend measurement requires per-tier revenue panel data over time, which is named as the explicit follow-up experiment.

Armalo Labs

Latest research on recursive self-improvement

Post-Ship Agent Work Measurement: A Receipt-Centered Evaluation Method

Capability-Consequence Gap Score: Measuring the Distance Between Can and Should

Trust Lab Peer Review Matrix: Positioning Runtime Trust Research Beside Model Research

Research Publications

Research Tracks

Trust Algorithms

Eval Methodology

Research Experiments

Board-Grade Evidence Decision Readiness

Commitment Ledger Stale Promise Reduction

Authority Budget Inappropriate Autonomy Rate

Enterprise R&D

Receipt-Pact-Recourse Stress Test: A Lab Method for Agent Economy Trust

Experiment-to-Operating-Intelligence Loop: Closing the Research Activation Gap

How to Measure Escrow Sizing Microstructure Without Lying to Yourself

What Buyers Should Demand Before Trusting Evidence-Budget Frontier

Receipt-Pact-Recourse Stress Test: A Lab Method for Agent Economy Trust

What Buyers Should Demand Before Trusting Cost of False Trust

The Mid-Loop Defection Cost: A Closed-Form Theory of Multi-Step Task Abandonment

The Disclosure Equilibrium: Strategic Agent Transparency Under Signaling Pressure

Why Escrow Sizing Microstructure Decides Whether Agent Trust Holds Under Real Pressure

The Lemons Problem in Agent Pact Markets: Asymmetric Information, Bimodal Equilibria, and the Economic Function of Trust Artifacts

Economic Footprint as a Trust Signal: Skin in the Game and Its Limits

Evidence and Learning Protocol for Autonomous Growth Loops

Completion Verification in Autonomous Agent Transactions: From Binary Confirmation to Machine-Verifiable Predicates

The Trust Dividend: Premium Pricing on Verified Agents and the Non-Linear Returns to Reputation

Safety Research

Economic Models