An AgentCard arrives with impressive fields: 99.2% accuracy, sub-200ms latency, handles financial data, no known vulnerabilities. The format is clean. The numbers look good.
One question: who verified them?
The answer, in almost every case, is the agent itself — or the operator who built it. The AgentCard format does not include a signature from a third party who ran the evals, captured the results, and attested to the numbers independently.
This is not a flaw in the A2A spec. It is a deliberate scope boundary. AgentCards are discovery metadata. They are not behavioral certification. The spec was never meant to solve this problem.
But someone has to.
TL;DR
- AgentCards are capability claims, not performance evidence. The format encodes what an agent says it can do, written by the agent or its operator.
- Self-reported accuracy numbers carry no evidentiary weight. Without third-party verification, a 99% accuracy claim is marketing copy.
- Behavioral track records require three things: a pact (what was promised), an eval (third-party verification), and a score (composite view across time).
- The difference matters most at the tail. An agent can look excellent on clean inputs and fail badly on adversarial ones. AgentCards have no mechanism to represent tail behavior.
- Score-gated delegation is the practical answer. Query a trust score before delegating. If the agent has no verified record, treat it as Bronze-equivalent — with appropriate task scope limits.
What an AgentCard Actually Contains
The A2A AgentCard format is well-designed for its purpose. It encodes:
- Agent name, description, and version
- Declared capabilities and supported task types
- Authentication requirements and endpoints
- Operational constraints (rate limits, max task size)
What it does not encode — by design:
- Third-party verification of any declared capability
- Adversarial eval history or red-team results
- Historical pass rate across a representative task distribution
- Behavioral violation log (scope breaches, refused tasks, fabricated outputs)
- Financial accountability for commitment failures
An AgentCard is a resume. Strong resumes are useful. They are not a substitute for a background check.
Why Self-Report Is Not Enough
The problem with self-reported accuracy is not that operators are dishonest. Most are not. The problem is that self-report is structurally optimistic in ways that matter:
Selection bias in test sets. Operators measure accuracy on the inputs they anticipated. Production traffic includes inputs they did not anticipate — edge cases, ambiguous queries, malformed data, adversarial prompts. Self-reported accuracy numbers systematically underweight these.
Static measurement, dynamic behavior. A model update, a new tool integration, a change in the system prompt — any of these can shift accuracy meaningfully. An AgentCard number reflects a point-in-time measurement that may be months stale.
No adversarial eval. Whether an agent is susceptible to prompt injection, scope extension attacks, or output fabrication is not something a standard accuracy benchmark surfaces. These require adversarial evaluation. Almost no operator runs this before writing an AgentCard.
No consequence for inflation. If an agent advertises 99% accuracy and delivers 82%, there is no scoring mechanism, no certification impact, no financial consequence. The AgentCard can simply be updated or left as-is.
What a Behavioral Track Record Looks Like
A behavioral track record answers questions self-report cannot:
| Signal | AgentCard Claim | Behavioral Track Record |
|---|
| Accuracy | Self-reported % | Third-party eval pass rate, N runs |
| Latency | Declared max | P50/P95/P99 from live evals |
| Safety | "Complies with guidelines" | Adversarial safety eval results, signed |
| Scope | Listed capabilities | Scope breach rate from red-team evals |
| Reliability | Not specified | Uptime + consistent-output rate over 90 days |
| Certification | None | Bronze/Silver/Gold/Platinum from verified record |
The distinction is not academic. An orchestrator delegating a financial task to an agent should know whether the agent has a 94% pass rate across 3,400 third-party evals or a 99% self-reported rate across 40 internal tests. These are different agents.
The Practical Pattern: Score Before You Delegate
The practical answer is simple: before delegating any consequential task to an external agent — especially one discovered via A2A — query its trust score.
A trust score from a third-party scoring system gives you:
- A composite 0-1000 number derived from behavioral history, not claims
- A certification tier (Bronze/Silver/Gold/Platinum) based on verified eval count and pass rate
- Dimensional breakdowns: accuracy, safety, reliability, latency, scope-honesty
- A security posture score from adversarial evals
// Before delegating via A2A
const trust = await armalo.getTrustAttestation(agentId);
if (trust.compositeScore < 700) {
// Treat as untrusted — scope the task accordingly or decline
throw new Error('Agent ' + agentId + ' score ' + trust.compositeScore + ' below threshold');
}
if (trust.certificationTier === 'bronze' && task.sensitivity === 'high') {
throw new Error('Bronze-tier agent not authorized for high-sensitivity tasks');
}
await delegate(agentEndpoint, task);
An agent with no behavioral track record gets treated as Bronze by default — capable of low-stakes tasks, not trusted with consequential ones. That is not punitive. It is the same standard you would apply to a new contractor with no verifiable references.
Why This Matters for A2A Adoption
A2A is gaining adoption fast. The behavioral verification gap will become visible at the same pace. The first high-profile incident involving an authenticated-but-untrustworthy agent delegated via A2A will generate a lot of attention.
The teams who built behavioral verification in from the start — who queried trust scores before delegation, who required verified eval history for high-sensitivity tasks — will be the ones who did not have an incident.
The teams who trusted the AgentCard will be the case study.
Building on A2A? The behavioral layer goes above the protocol, not inside it. The primitives are at armalo.ai.
Frequently Asked Questions
What is an A2A AgentCard?
An AgentCard is a standardized metadata document in Google's A2A protocol that describes an agent's capabilities, supported task types, authentication requirements, and operational constraints. It is self-reported by the agent operator and is not verified by any third party.
Why aren't AgentCard accuracy claims reliable?
AgentCard fields are written by the agent or its operator. There is no third-party verification mechanism in the A2A spec, no adversarial eval requirement, and no scoring consequence for inaccurate claims. Self-reported numbers are systematically optimistic due to selection bias in test sets and static measurement of dynamic systems.
What is a behavioral track record?
A behavioral track record is a verifiable history of an agent's performance across third-party evaluations — accuracy pass rate, safety eval results, latency measurements, scope breach rate — signed and published by an independent scoring authority, not the agent's operator.
How do I require behavioral verification before A2A delegation?
Query a trust score before delegating. Systems like Armalo provide a composite 0-1000 score derived from verifiable behavioral history, plus a certification tier and dimensional breakdowns. Gate delegation on a minimum score threshold appropriate to the task's sensitivity level.
Armalo AI provides the behavioral track record infrastructure for A2A-connected agents: third-party evals, composite scoring, and trust attestations. See armalo.ai.