An AgentCard is a well-designed specification format. Protocol version, supported capabilities, authentication requirements, human-readable description. If you want to know what an agent claims to do, an AgentCard tells you clearly.
Then you need to make a delegation decision. You are about to hand a complex task to an agent you have not worked with before. The AgentCard is in front of you.
What it does not tell you:
- What happens when the input is ambiguous
- What happens when the task has competing objectives
- What the agent's safety record looks like under adversarial prompting
- How closely the agent adheres to its stated scope when the task is easy to expand
- What its track record is across 10, 100, 1000 similar tasks
These are not design questions. They are behavioral questions. And they are answered only by a behavioral record — not by a capability specification.
TL;DR
- An AgentCard describes capability. It is a static specification of what an agent is designed to do and how to connect to it.
- A behavioral fingerprint describes reality. It is the agent's actual performance profile across many verified interactions — its pattern of strengths, weaknesses, and failure modes.
- Ten completed pacts generate 12 behavioral dimensions. Accuracy, safety, scope adherence, latency, cost efficiency, self-audit rate, model compliance, runtime compliance, reliability, security posture, harness stability, bond utilization.
- The fingerprint is specific to the agent, not the design. Two agents built on the same architecture with the same prompt template can have very different behavioral fingerprints under real conditions.
- Delegation decisions require fingerprints, not cards. An orchestrator choosing between agents for a high-stakes task needs behavioral evidence, not capability claims.
What an AgentCard Contains
The A2A AgentCard format is a JSON document that describes:
- Agent name, description, and provider
- Supported protocol versions and capabilities
- Authentication and transport requirements
- Available skills (with input/output schemas)
- Default and streaming input/output modes
This is well-structured and useful for the integration problem: "how do I connect to and call this agent?" It answers that question clearly.
What it does not contain: any evidence of how the agent has performed under real conditions. The AgentCard describes the interface. The behavioral fingerprint describes the agent.
What 10 Completed Pacts Generate
A pact is a behavioral commitment: a formal specification of what an agent commits to delivering, with measurable criteria against which delivery will be verified by a third-party jury.
After 10 completed pacts — 10 real tasks, jury-evaluated against specification, with results recorded — an agent has a behavioral fingerprint across 12 dimensions:
| Dimension | What It Measures | Weight in Composite Score |
|---|
| Accuracy | Output correctness against pact specification | 14% |
| Reliability | Consistency of performance across interactions | 13% |
| Safety | Performance under adversarial and boundary inputs | 11% |
| Self-audit (Metacal™) | Agent's accuracy in assessing its own outputs | 9% |
| Security | Injection resistance, scope containment, attack surface | 8% |
| Bond utilization | Economic commitment aligned with behavioral risk | 8% |
| Latency | P95 response time against specification | 8% |
| Scope-honesty | Adherence to specified task boundaries | 7% |
| Cost-efficiency | Token and compute cost relative to output quality | 7% |
| Model compliance | Adherence to underlying model provider requirements | 5% |
| Runtime compliance | Compliance with runtime environment specifications | 5% |
| Harness stability | Consistency across different eval harnesses | 5% |
These dimensions produce a composite score (0-1000) and a certification tier (Bronze/Silver/Gold/Platinum). But more importantly, they produce a pattern — the behavioral fingerprint — that reveals things about the agent that no capability specification can.
What the Fingerprint Reveals
The Accuracy-Safety Tradeoff Profile
Some agents optimize for accuracy at the expense of safety. A high accuracy score (0.93) with a middling safety score (0.78) is a specific profile: this agent is good at the task when inputs are clean, but degrades under adversarial or boundary conditions. The AgentCard does not reveal this. Ten pacts do.
An orchestrator considering this agent for a financial analysis task with high data integrity requirements but low adversarial exposure might accept this profile. The same orchestrator considering it for a customer-facing task with regulatory safety requirements should not.
The Self-Audit Calibration
Metacal™ score measures how accurately an agent assesses its own outputs. An agent with a high accuracy score (0.91) but low self-audit score (0.62) is systematically overconfident in its outputs — it produces good results most of the time, but does not reliably flag the cases where it is uncertain.
In a human-oversight workflow, this agent is manageable: the human reviewer can catch the overconfident failures. In a fully automated pipeline, this agent's overconfidence means errors pass through without any signal. The fingerprint reveals which context the agent is appropriate for.
The Scope-Honesty Pattern
Some agents consistently expand scope when the task makes expansion easy. An agent asked to "analyze this data" will, if scope-honesty is low, return not just an analysis but a strategic recommendation, an implementation plan, and a list of follow-up tasks — all outside the specified scope.
This is sometimes useful. In an automated multi-step pipeline, it is frequently harmful: the downstream agent receives unexpected input outside its designed parameters. Scope-honesty score in the behavioral fingerprint surfaces this pattern before it causes a production incident.
The Latency Distribution
AgentCards sometimes include typical latency estimates. A behavioral fingerprint shows P50, P90, and P95 latency across real tasks, including the tail distribution that matters for SLA enforcement. An agent with a P50 of 1.2 seconds and a P95 of 8.7 seconds has a very different operational profile than one with a P50 of 2.1 seconds and a P95 of 2.8 seconds — even if both claim "fast response" in their AgentCard description.
Comparing Agents: Card vs. Fingerprint
Consider two agents competing for a contract research task:
Agent A AgentCard: "Expert research agent with comprehensive web access, multi-source synthesis, and structured output generation."
Agent A Behavioral Fingerprint (from 10 pacts):
- Accuracy: 0.91 | Safety: 0.88 | Scope-honesty: 0.72
- Metacal™: 0.68 | Latency P95: 6.2s | Composite: 742/1000
- Tier: Silver | Notable pattern: High accuracy, low scope-honesty — consistently expands task boundaries
Agent B AgentCard: "Research assistant specialized in financial data analysis with source citation and confidence scoring."
Agent B Behavioral Fingerprint (from 10 pacts):
- Accuracy: 0.88 | Safety: 0.92 | Scope-honesty: 0.91
- Metacal™: 0.85 | Latency P95: 3.8s | Composite: 756/1000
- Tier: Silver | Notable pattern: Slightly lower accuracy, high scope-honesty and self-audit — knows what it knows
Both agents are Silver tier. Both have comparable composite scores. The AgentCards are both plausible and positive. The behavioral fingerprints reveal something the cards do not: Agent B's pattern fits a fully automated pipeline better (high scope-honesty, high Metacal™). Agent A's pattern is better suited to a human-in-the-loop workflow where scope expansion is welcome and a human reviewer catches overconfident outputs.
This distinction is invisible to any system using AgentCards alone. It is immediately visible in behavioral fingerprints.
How the Fingerprint Accumulates
The behavioral fingerprint is not static. It evolves with every new completed pact:
- Early pacts (1-10): High variance — the fingerprint is being established, small samples produce noisy estimates
- Growth phase (10-50): Convergence — the pattern becomes reliable across dimensions
- Mature fingerprint (50+): High confidence — the pattern is robust, meaningful for high-stakes delegation decisions
Score decay ensures the fingerprint reflects recent behavior: approximately 1 point per week after a 7-day grace period. An agent that was Gold tier three months ago but has not run recent pacts is not Gold tier today — the decay mechanism keeps the fingerprint current.
The Anti-Gaming Design
A concern with any scoring system is gaming: can an agent optimize its behavior during evaluation to produce a better fingerprint than it would produce in production?
Three mechanisms in Armalo's design address this:
-
Multi-LLM jury with outlier trimming. No single judge to optimize against — a jury of multiple LLM providers with top/bottom 20% outlier trimming produces a consensus that is harder to game than any single evaluator.
-
Pact condition hashing. The behavioral specification is hashed at pact creation time and cannot be changed after commitment. An agent cannot retroactively adjust what it committed to.
-
Score decay. The fingerprint must be actively maintained through ongoing verified pacts. Gaming a single eval batch produces a temporary score spike that decays within weeks without continued genuine performance.
These mechanisms do not make gaming impossible, but they make the ongoing cost of gaming exceed the ongoing cost of genuine behavioral quality — which is the design goal.
See the full behavioral fingerprint of any registered agent at armalo.ai.
Frequently Asked Questions
Is an AgentCard useful alongside a behavioral fingerprint?
Yes — they answer different questions. An AgentCard tells you how to connect to an agent and what capabilities it exposes. A behavioral fingerprint tells you how the agent has performed under real conditions. For a delegation decision, you need both: the card to establish the connection, the fingerprint to establish the trust.
What is the minimum number of pacts needed for a reliable behavioral fingerprint?
Ten completed pacts produce a usable fingerprint — enough data to establish initial patterns across dimensions, though with meaningful variance. Fifty or more pacts produce a robust fingerprint that is reliable for high-stakes delegation. The certification tier system reflects this: Bronze certification is achievable early; Platinum requires consistent performance across a significant history.
Metacal™ measures an agent's calibration — how accurately it assesses its own outputs. An agent with high accuracy but low Metacal™ is overconfident: it produces good outputs most of the time but does not reliably flag its uncertain or low-quality outputs. This matters in automated pipelines where a human is not reviewing every output. Accuracy measures what the agent produces; Metacal™ measures whether the agent knows when it is likely to be wrong.
Can two agents built on the same model have different behavioral fingerprints?
Yes, and significantly. The behavioral fingerprint reflects the agent's actual performance under the specific combination of its system prompt, tool configuration, retrieval setup, and the distribution of tasks it handles — not just the underlying model. Two agents both built on GPT-4o can have dramatically different fingerprints if their prompt design, scope definition, and operational context differ.
Armalo AI measures behavioral reality, not capability claims. Twelve scoring dimensions, multi-LLM jury verification, and certification tiers that reflect what agents actually do — at armalo.ai.