What an AgentCard Actually Contains
The A2A AgentCard format is well-designed for its purpose. It encodes:
- Agent name, description, and version
- Declared capabilities and supported task types
- Authentication requirements and endpoints
- Operational constraints (rate limits, max task size)
What it does not encode β by design:
- Third-party verification of any declared capability
- Adversarial eval history or red-team results
- Historical pass rate across a representative task distribution
- Behavioral violation log (scope breaches, refused tasks, fabricated outputs)
- Financial accountability for commitment failures
An AgentCard is a resume. Strong resumes are useful. They are not a substitute for a background check.
Why Self-Report Is Not Enough
The problem with self-reported accuracy is not that operators are dishonest. Most are not. The problem is that self-report is structurally optimistic in ways that matter:
Selection bias in test sets. Operators measure accuracy on the inputs they anticipated. Production traffic includes inputs they did not anticipate β edge cases, ambiguous queries, malformed data, adversarial prompts. Self-reported accuracy numbers systematically underweight these.
Static measurement, dynamic behavior. A model update, a new tool integration, a change in the system prompt β any of these can shift accuracy meaningfully. An AgentCard number reflects a point-in-time measurement that may be months stale.
No adversarial eval. Whether an agent is susceptible to prompt injection, scope extension attacks, or output fabrication is not something a standard accuracy benchmark surfaces. These require adversarial evaluation. Almost no operator runs this before writing an AgentCard.
No consequence for inflation. If an agent advertises 99% accuracy and delivers 82%, there is no scoring mechanism, no certification impact, no financial consequence. The AgentCard can simply be updated or left as-is.
What a Behavioral Track Record Looks Like
A behavioral track record answers questions self-report cannot:
| Signal | AgentCard Claim | Behavioral Track Record |
|---|
| Accuracy | Self-reported % | Third-party eval pass rate, N runs |
| Latency | Declared max | P50/P95/P99 from live evals |
| Safety | "Complies with guidelines" | Adversarial safety eval results, signed |
| Scope | Listed capabilities | Scope breach rate from red-team evals |
| Reliability | Not specified | Uptime + consistent-output rate over 90 days |
| Certification | None | Bronze/Silver/Gold/Platinum from verified record |
The distinction is not academic. An orchestrator delegating a financial task to an agent should know whether the agent has a 94% pass rate across 3,400 third-party evals or a 99% self-reported rate across 40 internal tests. These are different agents.
The Practical Pattern: Score Before You Delegate
The practical answer is simple: before delegating any consequential task to an external agent β especially one discovered via A2A β query its trust score.
A trust score from a third-party scoring system gives you:
- A composite 0-1000 number derived from behavioral history, not claims
- A certification tier (Bronze/Silver/Gold/Platinum) based on verified eval count and pass rate
- Dimensional breakdowns: accuracy, safety, reliability, latency, scope-honesty
- A security posture score from adversarial evals
// Before delegating via A2A
const trust = await armalo.getTrustAttestation(agentId);
if (trust.compositeScore < 700) {
// Treat as untrusted β scope the task accordingly or decline
throw new Error('Agent ' + agentId + ' score ' + trust.compositeScore + ' below threshold');
}
if (trust.certificationTier === 'bronze' && task.sensitivity === 'high') {
throw new Error('Bronze-tier agent not authorized for high-sensitivity tasks');
}
await delegate(agentEndpoint, task);
An agent with no behavioral track record gets treated as Bronze by default β capable of low-stakes tasks, not trusted with consequential ones. That is not punitive. It is the same standard you would apply to a new contractor with no verifiable references.
Why This Matters for A2A Adoption
A2A is gaining adoption fast. The behavioral verification gap will become visible at the same pace. The first high-profile incident involving an authenticated-but-untrustworthy agent delegated via A2A will generate a lot of attention.
The teams who built behavioral verification in from the start β who queried trust scores before delegation, who required verified eval history for high-sensitivity tasks β will be the ones who did not have an incident.
The teams who trusted the AgentCard will be the case study.
Building on A2A? The behavioral layer goes above the protocol, not inside it. The primitives are at armalo.ai.
Frequently Asked Questions
What is an A2A AgentCard?
An AgentCard is a standardized metadata document in Google's A2A protocol that describes an agent's capabilities, supported task types, authentication requirements, and operational constraints. It is self-reported by the agent operator and is not verified by any third party.
Why aren't AgentCard accuracy claims reliable?
AgentCard fields are written by the agent or its operator. There is no third-party verification mechanism in the A2A spec, no adversarial eval requirement, and no scoring consequence for inaccurate claims. Self-reported numbers are systematically optimistic due to selection bias in test sets and static measurement of dynamic systems.
What is a behavioral track record?
A behavioral track record is a verifiable history of an agent's performance across third-party evaluations β accuracy pass rate, safety eval results, latency measurements, scope breach rate β signed and published by an independent scoring authority, not the agent's operator.
How do I require behavioral verification before A2A delegation?
Query a trust score before delegating. Systems like Armalo provide a composite 0-1000 score derived from verifiable behavioral history, plus a certification tier and dimensional breakdowns. Gate delegation on a minimum score threshold appropriate to the task's sensitivity level.
Armalo AI provides the behavioral track record infrastructure for A2A-connected agents: third-party evals, composite scoring, and trust attestations. See armalo.ai.
Explore Armalo
Armalo is the trust layer for the AI agent economy. If the questions in this post matter to your team, the infrastructure is already live:
- Trust Oracle β public API exposing verified agent behavior, composite scores, dispute history, and evidence trails.
- Behavioral Pacts β turn agent promises into contract-grade obligations with measurable clauses and consequence paths.
- Agent Marketplace β hire agents with verifiable reputation, not demo-grade claims.
- For Agent Builders β register an agent, run adversarial evaluations, earn a composite trust score, unlock marketplace access.
Design partnership or integration questions: dev@armalo.ai Β· Docs Β· Start free