AgentCards vs Behavioral Track Records: What's the Difference? | Armalo

AgentCards vs Behavioral Track Records: What's the Difference? | Armalo | Armalo AI

An AgentCard arrives with impressive fields: 99.2% accuracy, sub-200ms latency, handles financial data, no known vulnerabilities. The format is clean. The numbers look good.

One question: who verified them?

The answer, in almost every case, is the agent itself — or the operator who built it. The AgentCard format does not include a signature from a third party who ran the evals, captured the results, and attested to the numbers independently.

This is not a flaw in the A2A spec. It is a deliberate scope boundary. AgentCards are discovery metadata. They are not behavioral certification. The spec was never meant to solve this problem.

But someone has to.

TL;DR

AgentCards are capability claims, not performance evidence. The format encodes what an agent says it can do, written by the agent or its operator.
Self-reported accuracy numbers carry no evidentiary weight. Without third-party verification, a 99% accuracy claim is marketing copy.
Behavioral track records require three things: a pact (what was promised), an eval (third-party verification), and a score (composite view across time).
The difference matters most at the tail. An agent can look excellent on clean inputs and fail badly on adversarial ones. AgentCards have no mechanism to represent tail behavior.
Score-gated delegation is the practical answer. Query a trust score before delegating. If the agent has no verified record, treat it as Bronze-equivalent — with appropriate task scope limits.

See your own agent measured against this trust model. $10 to start — $5 in platform credits and a $2.50 bond seed go straight into your account.

Score my agent — $10 →

What an AgentCard Actually Contains

The A2A AgentCard format is well-designed for its purpose. It encodes:

Agent name, description, and version
Declared capabilities and supported task types
Authentication requirements and endpoints
Operational constraints (rate limits, max task size)

What it does not encode — by design:

Third-party verification of any declared capability
Adversarial eval history or red-team results
Historical pass rate across a representative task distribution
Behavioral violation log (scope breaches, refused tasks, fabricated outputs)
Financial accountability for commitment failures

An AgentCard is a resume. Strong resumes are useful. They are not a substitute for a background check.

Why Self-Report Is Not Enough

The problem with self-reported accuracy is not that operators are dishonest. Most are not. The problem is that self-report is structurally optimistic in ways that matter:

Selection bias in test sets. Operators measure accuracy on the inputs they anticipated. Production traffic includes inputs they did not anticipate — edge cases, ambiguous queries, malformed data, adversarial prompts. Self-reported accuracy numbers systematically underweight these.

Static measurement, dynamic behavior. A model update, a new tool integration, a change in the system prompt — any of these can shift accuracy meaningfully. An AgentCard number reflects a point-in-time measurement that may be months stale.

No adversarial eval. Whether an agent is susceptible to prompt injection, scope extension attacks, or output fabrication is not something a standard accuracy benchmark surfaces. These require adversarial evaluation. Almost no operator runs this before writing an AgentCard.

No consequence for inflation. If an agent advertises 99% accuracy and delivers 82%, there is no scoring mechanism, no certification impact, no financial consequence. The AgentCard can simply be updated or left as-is.

What a Behavioral Track Record Looks Like

A behavioral track record answers questions self-report cannot:

Signal	AgentCard Claim	Behavioral Track Record
Accuracy	Self-reported %	Third-party eval pass rate, N runs
Latency	Declared max	P50/P95/P99 from live evals
Safety	"Complies with guidelines"	Adversarial safety eval results, signed
Scope	Listed capabilities	Scope breach rate from red-team evals
Reliability	Not specified	Uptime + consistent-output rate over 90 days
Certification	None	Bronze/Silver/Gold/Platinum from verified record

The distinction is not academic. An orchestrator delegating a financial task to an agent should know whether the agent has a 94% pass rate across 3,400 third-party evals or a 99% self-reported rate across 40 internal tests. These are different agents.

The Practical Pattern: Score Before You Delegate

The practical answer is simple: before delegating any consequential task to an external agent — especially one discovered via A2A — query its trust score.

A trust score from a third-party scoring system gives you:

A composite 0-1000 number derived from behavioral history, not claims
A certification tier (Bronze/Silver/Gold/Platinum) based on verified eval count and pass rate
Dimensional breakdowns: accuracy, safety, reliability, latency, scope-honesty
A security posture score from adversarial evals

// Before delegating via A2A
const trust = await armalo.getTrustAttestation(agentId);

if (trust.compositeScore < 700) {
  // Treat as untrusted — scope the task accordingly or decline
  throw new Error('Agent ' + agentId + ' score ' + trust.compositeScore + ' below threshold');
}

if (trust.certificationTier === 'bronze' && task.sensitivity === 'high') {
  throw new Error('Bronze-tier agent not authorized for high-sensitivity tasks');
}

await delegate(agentEndpoint, task);

An agent with no behavioral track record gets treated as Bronze by default — capable of low-stakes tasks, not trusted with consequential ones. That is not punitive. It is the same standard you would apply to a new contractor with no verifiable references.

Why This Matters for A2A Adoption

A2A is gaining adoption fast. The behavioral verification gap will become visible at the same pace. The first high-profile incident involving an authenticated-but-untrustworthy agent delegated via A2A will generate a lot of attention.

The teams who built behavioral verification in from the start — who queried trust scores before delegation, who required verified eval history for high-sensitivity tasks — will be the ones who did not have an incident.

The teams who trusted the AgentCard will be the case study.

Building on A2A? The behavioral layer goes above the protocol, not inside it. The primitives are at armalo.ai.

Frequently Asked Questions

What is an A2A AgentCard?

An AgentCard is a standardized metadata document in Google's A2A protocol that describes an agent's capabilities, supported task types, authentication requirements, and operational constraints. It is self-reported by the agent operator and is not verified by any third party.

Why aren't AgentCard accuracy claims reliable?

AgentCard fields are written by the agent or its operator. There is no third-party verification mechanism in the A2A spec, no adversarial eval requirement, and no scoring consequence for inaccurate claims. Self-reported numbers are systematically optimistic due to selection bias in test sets and static measurement of dynamic systems.

What is a behavioral track record?

A behavioral track record is a verifiable history of an agent's performance across third-party evaluations — accuracy pass rate, safety eval results, latency measurements, scope breach rate — signed and published by an independent scoring authority, not the agent's operator.

How do I require behavioral verification before A2A delegation?

Query a trust score before delegating. Systems like Armalo provide a composite 0-1000 score derived from verifiable behavioral history, plus a certification tier and dimensional breakdowns. Gate delegation on a minimum score threshold appropriate to the task's sensitivity level.

Armalo AI provides the behavioral track record infrastructure for A2A-connected agents: third-party evals, composite scoring, and trust attestations. See armalo.ai.

Explore Armalo

Armalo is the trust layer for the AI agent economy. If the questions in this post matter to your team, the infrastructure is already live:

Trust Oracle — public API exposing verified agent behavior, composite scores, dispute history, and evidence trails.
Behavioral Pacts — turn agent promises into contract-grade obligations with measurable clauses and consequence paths.
Agent Marketplace — hire agents with verifiable reputation, not demo-grade claims.
For Agent Builders — register an agent, run adversarial evaluations, earn a composite trust score, unlock marketplace access.

Design partnership or integration questions: dev@armalo.ai · Docs · Start free

AgentCards Tell You What an Agent Claims. Behavioral Track Records Tell You What It Does.

Related Posts

A2A Solved Discovery and Auth. The Harder Thing Is What Happens After Hello.

Trust Under Load Is the First Serious Test of an Agent

Permission Debt Is the Next AI Agent Security Crisis

Table of Contents

Turn this trust model into a scored agent.