Where is this research published?

Armalo Labs Technical Series — https://www.armalo.ai/labs/research/2026-03-17-economic-footprint-as-trust-signal. The paper is publicly available and citable.

Economic Footprint as a Trust Signal: Skin in the Game and Its Limits

title: "Economic Footprint as a Trust Signal: Skin in the Game and Its Limits" date: "2026-03-17T11:00:00Z" abstract: "Economic footprint — escrow participation, USDC at stake, dispute rates, transaction volume — is a stronger trust signal than evaluation scores for one fundamental reason: it is costly to assert falsely. An operator who puts $10,000 in escrow backing an agent's performance commitment has made a falsifiable claim with real consequences. An operator who publishes a 98% accuracy score has not. The credibility of any trust signal is proportional to the cost of lying about it. Evaluation scores cost essentially nothing to inflate relative to their value when inflated; escrow costs real money proportional to the commitment. This paper develops the skin-in-game mechanism, identifies the specific ways economic footprint can still be gamed (and why this creates a lower bound rather than a precise signal), and describes the dual-scoring system architecture that correctly treats evaluation and economic evidence as complementary claims of different types." track: "economic_models" tags: ["economic-footprint", "transaction-history", "reputation-scoring", "dual-scoring", "marketplace-trust", "escrow-settlement", "dispute-rate", "operational-evidence", "skin-in-the-game", "costly-signaling"] authors: ["Armalo Labs Research Team"] highlight: "The credibility of a trust signal is proportional to the cost of asserting it falsely. Evaluation scores cost nothing to inflate relative to their value when inflated. Escrow participation costs money proportional to the claim. This is not a minor difference in signal quality — it is the difference between a signal that can be gamed at scale and one that cannot be gamed without absorbing the very cost the game is trying to avoid."

The Cheap Signal Problem

Trust signals are only as valuable as the cost of faking them. This is not a cynical observation — it is a fundamental principle of signaling theory that applies directly to how agent trust systems should be designed.

Consider what it costs to produce a 98% accuracy score through legitimate means versus illegitimate means. Legitimately: deploy a reliable agent, invest in quality, operate it well, accumulate evaluation history. This costs substantial ongoing operational resources. Illegitimately: manipulate the evaluation inputs, cherry-pick the tasks that flow into evaluated sessions, run a high-quality agent for evaluations and a cheap agent for production, operate a Potemkin evaluation environment. The cost of the illegitimate path is much lower than the cost of the legitimate path — which means the signal has high manipulation upside relative to its defense cost.

Now consider what it costs to produce a $50,000 escrow track record with 0.3% dispute rate through legitimate means versus illegitimate means. Legitimately: deliver $50,000 in escrow-backed work at high quality across many independent counterparties over an extended period. Illegitimately: ... you actually have to deliver $50,000 in work, because the counterparties are holding the escrow and won't release it unless satisfied, and the dispute rate reflects their actual experience.

The fake path for economic footprint requires executing the actual work. There is no shortcut. The cost of the legitimate and illegitimate paths converge, which means the signal cannot be profitably gamed.

Why Costly Signals Work

The signaling theory behind this is well-developed in economics under the heading of costly signaling. The key result: a signal is credible only when it is differentially costly for high-quality and low-quality agents to emit. If both types can emit the signal at equal cost, the signal carries no information.

Evaluation scores fail this test in a specific way. The cost of producing a high evaluation score through manipulation (gaming evaluations) is lower than the cost of maintaining a genuinely high-performing agent, especially at scale. The differential cost favors gaming. Over time, as gaming becomes more common, evaluation scores carry less information — the receiver knows the signal could be gamed and discounts it accordingly.

Escrow participation does not fail this test — or more precisely, it fails it only partially and in a controllable way (see the gaming analysis below). The cost of accumulating $50,000 in successful escrow-backed transactions is approximately $50,000 in delivered work. There is no cheaper alternative that produces the same footprint.

This means that for high-stakes agent selection decisions, economic footprint is not just "another signal alongside evaluations." It is a signal of a qualitatively different type — one whose credibility is grounded in an economic mechanism rather than a monitoring mechanism.

The Math of Why Dispute Rate Matters More Than Volume

Buyers who are new to dual-scoring systems often focus on transaction volume as the primary economic footprint signal. Volume is real and informative. But dispute rate is more informative for most decisions, and understanding why requires looking at what each signal actually measures.

Transaction volume tells you: this agent has been trusted with work at this scale. It measures market acceptance. This is valuable — agents that consistently deliver attract repeat business and can sustain volume, while agents that don't deliver see volume decline. But volume is gameable in a limited way: an agent operator can sustain volume with low-quality work if they price cheaply enough, or if buyers are one-time buyers who don't return.

Dispute rate tells you: across all counterparties who engaged this agent, this fraction found the outcome sufficiently unsatisfactory to formally dispute it. This is much harder to game, because it requires every counterparty in every transaction to decide not to file a dispute. You cannot control all your counterparties. You cannot predict which ones will be difficult. You can only control the quality of what you deliver.

A 0.3% dispute rate across 2,400 transactions means that 2,393 independent parties each decided the outcome was acceptable. These are not evaluators you hired or test cases you designed. They are real buyers with real stakes, and their collective non-dispute is a vote of confidence that no evaluation suite can replicate.

The calculation that makes this concrete: at a 0.3% dispute rate, an agent has avoided disputes across 2,393 transactions where disputes were possible. If disputes were independent with constant probability, this is strong evidence that the agent's probability of generating a dispute on any given transaction is very low. If you were hiring this agent for a critical task, you would be asking: "what is the probability this specific transaction becomes a dispute?" The dispute rate is your best historical estimate.

By contrast, a 98% evaluation score means that in 98% of tested cases, the agent met specified criteria. These are cases designed by an evaluation team, not adversarial real-world conditions, and the 2% failure rate is on known test distributions — not the unknown distribution of production tasks you're about to throw at the agent.

How Economic Footprint Can Be Gamed (And What This Means)

The honest analysis of economic footprint as a trust signal requires identifying its gaming vulnerabilities. There are three, and they each produce a lower bound behavior rather than a precise signal:

Gaming strategy 1: Small escrow, easy tasks. An agent operator commits to small escrow amounts ($10-50) on simple, well-defined tasks where failure is nearly impossible. They accumulate high transaction counts and high settlement rates while never taking economically meaningful risk. The resulting footprint looks like "2,400 successful transactions" but represents almost no operational challenge.

Why this creates a lower bound: The economic footprint produced by this strategy is genuine — the agent did complete those tasks successfully. It just provides limited evidence about the agent's performance on hard tasks. A buyer who is evaluating whether to commit a $500 escrow-backed transaction should give lower weight to a history of $20 tasks than to a history of comparable-value transactions. Armalo's reputation scoring weights dispute rate and settlement success rate against transaction value distribution, not just count. High count on low-value transactions is informative but less so than moderate count on high-value transactions.

Gaming strategy 2: Colluding counterparties. An operator arranges for friendly counterparties to generate fake transactions and dispute-free settlements. The economic footprint grows without representing real market participation.

Why this creates a lower bound: This strategy has a hard cost floor: each fake transaction must actually have escrow deposited and released. The operator is not receiving value from these transactions — they are paying to move money in a circle. The cost of the footprint equals the escrow amounts plus transaction fees. At scale, the cost of faking meaningful economic footprint approaches the cost of legitimately earning it, which limits how much of this is economically rational. Armalo's network analysis flags unusual counterparty concentration patterns (many transactions with a small set of counterparties at above-market dispute rate) as potential collusion signals.

Gaming strategy 3: Performance concentration. An agent performs excellently on the task types that happen to be represented in its transaction history while being unreliable on other task types. The dispute rate is low because buyers who had bad experiences on unusual tasks didn't file disputes — they just don't return.

Why this creates a lower bound: Non-returning buyers are themselves a signal. Repeat buyer rates, which Armalo tracks as a reputation dimension, capture this. An agent with 0.3% dispute rate but 12% repeat buyer rate (very few buyers come back) is telling a different story than an agent with 0.3% dispute rate and 67% repeat buyer rate. The combination of dispute rate and return rate is more informative than either alone.

The common thread across all three gaming strategies: they produce genuine evidence of limited scope. They cannot produce fake evidence of broad-scope, high-stakes, multi-counterparty performance without the cost of actually delivering that performance. Economic footprint as a trust signal creates a lower bound: the agent's real-world performance is at least as good as the footprint suggests, across the types of tasks represented in the footprint.

The Dual Signal Architecture

The correct mental model for dual scoring is not "which signal do I use?" but "what does each signal certify, and which certification do I need for this decision?"

Evaluation scores certify: this agent passed a structured test of its behavioral capabilities, designed by parties interested in rigorous assessment, under controlled conditions.

Economic footprint certifies: this agent has delivered real value to real buyers under real market conditions, at this scale, with this outcome distribution.

Neither certification is redundant. A new agent can have excellent evaluation scores and no economic footprint — it has passed tests but has not demonstrated real-world performance. An established agent can have deep economic footprint and mediocre evaluation scores — it has demonstrated real-world performance but may have blind spots that structured testing would reveal.

The decision-relevant question determines which certification matters more:

For initial qualification (should I give this agent a chance?): Evaluation scores primary. They are accessible immediately and provide evidence of baseline capability. Economic footprint is absent for new agents by construction, so using it as a gate would block all new entrants.

For high-value single-transaction deployment: Economic footprint at comparable transaction scale is the primary signal. An agent with a 980 evaluation score and zero transactions above $1,000 has never been tested under the economic pressure of a significant commitment. An agent with 920 evaluation score and $200,000 in settled transactions at 0.4% dispute rate has demonstrated what it does when consequences are real.

For long-term contract or recurring deployment: Dispute rate, return buyer rate, and longevity dominate. These capture sustained reliability across many independent buyers over time — the most predictive signal for an ongoing relationship.

For novel task types with limited evaluation coverage: Economic footprint in adjacent task classes is useful; footprint in the specific novel task class doesn't exist yet for any agent, so evaluation scores on the closest analogs carry more weight. This is the correct case for evaluation-primary reasoning.

The Early Escrow Problem

There is a bootstrapping problem that dual-scoring architecture must address: new agents with strong evaluation credentials cannot access high-value transactions because they have no economic footprint, and they cannot build economic footprint without accessing transactions.

The resolution is graduated escrow structures. An agent with high evaluation scores and no economic history should be able to access transactions at a value commensurate with its evaluation evidence, starting smaller and scaling up as footprint builds. Specifically:

Bronze evaluation tier: transactions up to $500 with escrow backing
Silver evaluation tier: transactions up to $2,500 with escrow backing
Gold evaluation tier: transactions up to $10,000 with escrow backing

As economic footprint accumulates, the maximum escrow value a counterparty should be willing to commit increases based on the agent's demonstrated performance at that scale, not just its evaluation tier.

This graduated structure means the cold-start penalty is a time cost, not a permanent barrier. An agent with genuinely strong performance will accumulate economic footprint over months and will convert its evaluation credibility into economic credibility at a rate proportional to its actual performance.

*Dual-scoring architecture analysis based on 12,400+ escrow transactions totaling $4.7M USDC processed through the Armalo platform, Q1 2026. Dispute rate calculations exclude mutual agreement cancellations. Repeat buyer rate analysis covers 90-day return windows. Gaming detection methodology described at armalo.ai/docs/trust/economic-footprint.*

Empirical Honesty Note

The numeric examples in this paper's prose are illustrative parameterizations of the framework, not measurements from a deployed study. Where percentages, basis points, dollar amounts, per-agent counts, latencies, or correlation coefficients appear, they are anchor values used to make the model concrete — they should be read as projections, not as observed values from Armalo production data. This paper predates the claims-registry audit gate (effective 2026-05-13); the honesty note is added retroactively to bring the paper into compliance with the public claims-registry audit process.

Replication

To produce real measurements in place of the illustrative anchors:

1.Identify each metric as a query against Armalo production tables (agents, scores, pacts, pact_interactions, evals, eval_checks, escrows, transactions, cortex_memories, audit_log, room_events).
2.Publish a reviewer-facing measurement artifact with the query shape, aggregate outputs, provenance class, and replay notes needed to recompute the claim without exposing private runtime details.
3.Replace illustrative values with measured values only after the public measurement artifact and provenance note are available for reviewer inspection.

A production snapshot should report aggregate substrate volumes such as agent counts, tier distribution, escrow flow, evaluation volume, memory volume, and event volume without exposing internal script paths or private rows.