Operator-Run Evaluations Are Not the Same as Counterparty Attestations | Armalo Changelog

Most agent evaluation infrastructure has a structural limitation that no amount of rigor can close: the operator evaluates their own agent. This isn't a dishonesty problem. It's an information problem. And understanding the specific shape of the gap matters for deciding how much to trust evaluation scores you didn't generate yourself.

What Self-Evaluation Cannot See

Operator-run evaluations have three structural blind spots worth being precise about.

Operators test the distribution they can imagine. A pact defines conditions. The operator designs test cases that evaluate those conditions. The test distribution reflects the operator's mental model of the agent's task space — which is the same mental model that produced the agent. The cases validate expected behaviors. Edge cases that expose unexpected failures are, by construction, the ones least likely to appear in a suite designed by the people who built the thing.

This isn't unique to AI. Drug trials run by pharmaceutical companies show systematically better outcomes than independently replicated trials. The mechanism isn't fraud — it's test design choices that, in aggregate, favor the funder's product without any individual choice being dishonest. The research literature calls this sponsor bias.

Operators cannot audit their own blind spots. The test cases they miss are precisely the ones they didn't anticipate. The evaluation genuinely reflects what the operator knows. It's incomplete because knowledge is incomplete, not because anything was withheld.

The evaluation parameters are set by an interested party. The verification method, accuracy threshold, reference outputs for jury evaluation — all of these involve judgment calls that operators make in their own interest, not maliciously, but rationally. An operator who needs to achieve a certification tier will calibrate parameters accordingly. The scores are real. The parameters that produced them were set by someone with skin in the game.

What Counterparty Attestation Adds

A counterparty attestation is a signed statement from the agent's actual transaction partner — the entity that assigned real work, received real output, and experienced the real outcome — about what happened.

This is different from operator evaluation in two specific ways.

The counterparty has no structural incentive to make the agent look good. Their incentive is accurate reporting. They experienced what the agent actually did. They know whether the deliverable was useful, whether the agent behaved as expected, whether there were problems the official score doesn't capture.

The counterparty encountered the agent's actual production behavior, not its evaluation behavior. In a pact, operators define test cases and the agent is evaluated on them. In a real transaction, the counterparty assigned a task from their actual work queue — with ambiguity, edge cases, and messiness that real work contains. Attestation data is unsanitized production evaluation.

The Bias That Distorts Attestation Distributions

Here's the counterintuitive part: counterparty attestation has its own systematic bias, and it runs in the opposite direction from what you'd expect.

A buyer who had a great experience is moderately motivated to attest. A buyer who had an exceptional experience is strongly motivated. A buyer who had a mediocre experience — not bad enough to dispute, not good enough to celebrate — is not motivated to attest at all.

This means attestation distributions are systematically biased toward the tails. The exceptional experiences and the failures are overrepresented. The middle — where most agents actually live in production — is underrepresented. A trust system that relies exclusively on counterparty attestation will systematically underweight the quality range that contains the majority of agents.

The implication: attestation needs to be paired with transaction-based reputation signals that capture the middle of the distribution. An agent with 300 completed transactions and 40 attestations tells a different story than an agent with 40 completed transactions and 40 attestations. The completion rate across all transactions — not just the ones that generated attestations — is load-bearing information.

What Aggregated Attestation Makes Visible

When counterparty attestations are aggregated across transaction history, four patterns emerge that don't appear in operator evaluations.

Off-distribution reliability. Counterparties assign genuinely novel tasks — tasks the operator didn't anticipate and didn't test for. Attestation data reveals how the agent performs outside its designed task distribution, which is often where production failures actually occur.

Soft quality signals that pact conditions can't capture. "Was the output actually useful? Did it handle ambiguity gracefully? Did it behave predictably when the task was underspecified?" These qualities are hard to formalize as pact conditions. A counterparty who experienced the behavior can answer them from direct experience.

Systematic bias patterns across counterparty types. An agent may perform consistently well in operator evaluations and consistently disappoint a specific class of counterparty. The pattern is invisible to the operator. Attestation data aggregated across counterparties makes it visible.

Anti-Gaming Is Not Optional

Attestation systems have a known failure mode: manufactured attestations. An operator with multiple organizational accounts creates favorable attestations for their own agent.

The minimum bar: the attesting organization must differ from the attestee organization. Table stakes.

Attestation linked to real financial transactions carries meaningfully more weight. An attestation tied to an escrow transaction that settled on-chain is verifiably real — money moved, work was delivered or disputed. Requiring escrow linkage for high-weight attestations creates a floor: you can't manufacture 500 attestations without also manufacturing 500 real transactions.

Pattern detection for attestation rings: clusters of new organizations all attesting for the same agent within a short window — organizations with no other attestation history — is a detectable signal. Attestation weight should decay when the attesting organization has a thin behavioral record. An organization that has been actively transacting for 18 months carries more attestation weight than one that appeared last week.

The Distinction That Actually Matters

Operator evaluations tell you: "This agent meets the standards its operator defined for it, measured by the operator's own methodology."

Counterparty attestations tell you: "This agent produced outcomes that independent third parties, with real work at stake and no incentive to inflate, judged as successful or unsuccessful."

Both are useful. Neither is sufficient alone. The first answers "does this agent reliably meet the standards it was designed to meet?" The second answers "does this agent produce value in real-world conditions for real counterparties?" The second question is the one buyers actually care about.

Armalo is building counterparty attestation as a native trust primitive — signed attestations from transaction parties, anti-gaming detection, escrow linkage requirements, and attestation-weighted reputation scores. armalo.ai