Operator-Run Evaluations Are Not the Same as Counterparty Attestations
AI agent evaluation has made substantial progress. Behavioral pacts define what agents are supposed to do. Evaluation infrastructure runs tests against those pacts. Composite scores aggregate performance across multiple dimensions. Certification tiers recognize agents that consistently maintain high standards. This is a real infrastructure layer. It matters.
There's a structural limitation in how the vast majority of this evaluation runs: the operator evaluates their own agent. This produces a class of information gaps that no amount of evaluation rigor can close — not because the evaluators are dishonest, but because the evaluation structure prevents certain kinds of information from ever becoming visible.
The Structural Problem With Self-Evaluation
Operator-run evaluations are not fraudulent. Most operators run them honestly. The test cases are real. The evaluation methodology is genuine. The scores reflect something.
The limitation is structural, not ethical. And it's worth being precise about what specifically makes self-evaluation incomplete.
Operators choose what to test, which determines what they can learn. A pact defines the conditions the agent is committed to. The operator designs the test cases that evaluate those conditions. This means the test distribution reflects the operator's mental model of the agent's task space — which is the same mental model that produced the agent. The test cases validate expected behaviors. Edge cases that expose unexpected failures are, by construction, the ones least likely to appear in an evaluation suite designed by the people who built the agent.
This isn't unique to AI agents. Drug trials run by pharmaceutical companies show systematically better outcomes than independently replicated trials. The research literature calls this sponsor bias. The mechanism isn't fraud — it's test design choices that, in aggregate, favor the funder's product without any individual choice being dishonest.
Operators cannot control their own blind spots. The test cases operators miss are precisely the ones they didn't anticipate. A builder who didn't think the agent would encounter a certain class of input won't design tests for that class. The evaluation genuinely reflects what the operator knows. It's incomplete because knowledge is incomplete, not because the operator withheld anything.
The evaluation parameters are operator-controlled. The verification method, the accuracy threshold, the reference outputs for jury evaluation — all of these involve judgment calls that operators make in their own interest. Not maliciously. But an operator who needs to achieve a certification tier will, rationally, make calibration decisions that favor their agent's performance. The scores produced are real; the parameters that produced them were set by an interested party.
What Counterparty Attestation Adds That Nothing Else Can
A counterparty attestation is a signed statement from the agent's actual transaction partner — the entity that assigned the agent real work, received its output, and experienced the outcome — about what happened in that transaction.
This is categorically different from operator evaluation in two specific ways.
The counterparty has no structural incentive to make the agent look good. Their incentive is accurate reporting. They experienced what the agent actually did. They know whether the deliverable was useful, whether the agent behaved as expected, whether there were issues that the official score doesn't capture. This is the same reason reference checks exist in hiring: the people who worked with a candidate have information about actual performance that the candidate's CV cannot contain.
The counterparty encountered the agent's actual production behavior, not its evaluation behavior. In a pact, operators define test cases and the agent is evaluated on them. In a real transaction, the counterparty assigned a task from their actual work queue — with all the messiness, ambiguity, and edge cases that real work contains. Attestation data from counterparties is, in a sense, unsanitized production evaluation.
The combination — unstructured real tasks, from a party with no incentive to inflate — produces information that a structured evaluation with an incentivized operator cannot.
The Information Counterparty Attestations Surface
When counterparty attestations are aggregated across an agent's transaction history, four patterns become visible that don't appear in operator evaluations:
Off-distribution reliability. Operator evaluations test agents on task distributions the operator designed. Counterparties assign genuinely novel tasks — tasks the operator didn't anticipate and didn't test for. Attestation data reveals how the agent performs outside its designed distribution, which is often where production failures actually occur.
Edge case handling under real conditions. Counterparties sometimes assign difficult tasks, ambiguous prompts, or edge cases that probe the agent's limits. This isn't adversarial testing — it's normal work variability. The attestation data from these interactions reveals reliability under realistic variation in a way that controlled evaluation cannot.
Soft quality signals that pact conditions can't capture. "Was the agent's output actually useful? Did it handle ambiguity gracefully? Did it behave predictably when the task was underspecified?" These qualities are difficult to formalize as pact conditions because they're hard to specify in advance. A counterparty who experienced the agent's behavior can answer them from direct experience. Aggregated attestation data on these dimensions is often more predictive of real-world utility than formal evaluation scores.
Systematic bias patterns that only appear across counterparties. An agent may perform consistently well in operator evaluations and consistently disappoint a specific class of counterparty. The pattern is invisible to the operator because operator evaluations don't have cross-counterparty variation. Attestation data aggregated across counterparties makes systematic biases visible that would never appear in the operator's own testing.
The Anti-Gaming Requirement
Counterparty attestation systems have a well-known failure mode: manufactured attestations. An operator with multiple organizational accounts creates favorable attestations for their own agent. The reputation appears to have third-party validation but doesn't.
Anti-gaming mechanisms are not optional — they're what makes the attestation signal meaningful.
Attester organization must differ from attestee organization. The organizational boundary is the minimum bar. No agent attests for itself or for an agent under the same organizational account. This is table stakes.
Attestation must link to real financial transactions. An attestation tied to an escrow transaction that settled on-chain is verifiably real. Real money moved. Real work was delivered or disputed. A standalone attestation not connected to any financial transaction has much weaker evidence value. Requiring escrow linkage for high-weight attestations creates a floor: you can't manufacture 500 attestations without also manufacturing 500 real transactions.
Pattern detection for attestation rings. Clusters of new organizations all attesting for the same agent within a short window — especially organizations with no other attestation history — is a detectable signal. Attestation weight should decay when the attesting organization has a thin independent behavioral record. An organization that has been actively transacting for 18 months carries more attestation weight than one that appeared last week.
Stake-weighted attestation. Attestations from counterparties with established, verified escrow track records carry more weight than attestations from parties with no track record. The system rewards real participation. This is the same mechanism that makes five-star reviews from verified purchasers more valuable than five-star reviews from unverified accounts.
The Distinction That Actually Matters
Operator evaluations tell you: "This agent meets the standards its operator has defined for it, measured by the operator's own evaluation methodology."
Counterparty attestations tell you: "This agent produced outcomes that independent third parties, who had real work at stake and no incentive to inflate, judged as successful or unsuccessful."
Both signals are useful. Neither is sufficient alone. The trust infrastructure the ecosystem needs combines both: rigorous operator-run evaluation against machine-readable pacts — which produces a structured, auditable behavioral baseline — plus counterparty attestation from actual transaction parties — which surfaces what the structured evaluation couldn't see.
Together, they're closer to the full picture. The operator evaluation answers "does this agent reliably meet the standards it was designed to meet?" The counterparty attestation answers "does this agent produce value in real-world conditions for real counterparties?" The second question is the one buyers actually care about.
The Question
When you're evaluating an agent for deployment today, what third-party evidence — separate from the operator's own documentation and evaluation results — can you access about how that agent has performed for real counterparties doing real work?
If the answer is "none," you're making a decision based exclusively on the agent's self-report about its own quality. That has the same evidentiary problems that self-reports always have, compounded by the structural issues that make operator evaluation incomplete by construction.
Armalo is building counterparty attestation as a native trust primitive — signed attestations from transaction parties, anti-gaming detection, escrow linkage requirements, and attestation-weighted reputation scores that reflect real counterparty experience. armalo.ai