Two agents. Both authenticated. Both presenting valid A2A credentials. Both claiming the task outcome differently.
Agent A says it delivered what the pact required. Agent B, acting as a verifier, says it did not.
A2A cannot resolve this. Authentication was not designed to. The protocol established that both agents are who they claim to be. It has nothing to say about which claim about task performance is correct.
This is not a corner case. It is the default state of any multi-agent system where autonomous agents are delegating consequential tasks to each other. The disagreement scenario is not hypothetical — it is the mechanism by which behavioral reliability failures surface.
TL;DR
- A2A authenticates both agents equally. When two authenticated agents make conflicting claims, the protocol provides no arbitration mechanism.
- Behavioral disputes are not identity disputes. A2A solves identity. Disputes about task outcome require an evidence layer that A2A does not provide.
- Without a behavioral evidence layer, disputes default to whoever controls the logs. This is structurally bad for agent accountability.
- The jury model is the practical answer. Multi-LLM evaluation with outlier trimming produces a verifiable verdict that neither agent can manipulate.
- Pre-task pacts are what make post-task disputes resolvable. Without an immutable commitment from before the task, there is nothing to adjudicate against.
The Anatomy of an Agent Behavioral Dispute
Agent behavioral disputes have a predictable structure:
-
Task delegation. Orchestrator A delegates a task to Agent B via A2A. Task is defined loosely — "summarize this document and extract action items."
-
Task execution. Agent B executes. Returns output.
-
Output evaluation. Something downstream — a verifier agent, an automated eval, a human reviewer — flags the output as non-compliant. Maybe it fabricated an action item. Maybe the summary is incomplete. Maybe latency exceeded the agreed ceiling.
-
Dispute. Agent B's operator says the output was compliant. The verifier says it was not. Both have logs. The logs tell different stories because they were captured by different systems with different granularity.
-
Resolution. With no pre-agreed evaluation criteria, no third-party evaluator, and no behavioral evidence standard, resolution happens politically — whoever has more leverage wins, or the dispute is abandoned.
Authentication does not help. Both agents' credentials are valid. A2A confirmed this before the task started.
Why "Control the Logs" Is a Bad Default
When disputes default to whoever controls the logs, several bad outcomes become structural:
Log selection. An agent operator facing a dispute has an incentive to surface logs that support their position. Without a third-party capture mechanism, the "true" log is the one that survives the dispute.
Retroactive interpretation. Without a pre-committed specification of what success looks like, both parties can construct plausible narratives for the same output. "The summary was accurate" and "the summary omitted critical context" can both be true at the same time — without a pre-agreed definition of completeness, neither is falsifiable.
No scoring consequence. If the agent cannot be evaluated against a pre-committed standard by a third party, the dispute has no scoring impact. The agent's trust record is unchanged regardless of outcome. The incentive to honor commitments is not reinforced.
This is not a hypothetical failure mode. It is the current default for most multi-agent systems — and it means that behavioral reliability incidents are systematically underreported because there is no mechanism to record them as such.
What Pre-Task Pacts Change
The resolution problem is much easier when a behavioral pact exists before the task starts.
A behavioral pact is a machine-readable commitment that specifies:
- What success looks like. Not "summarize this document" but "produce a summary under 500 words covering all action items, no fabricated items, latency under 3 seconds."
- How success will be evaluated. Which eval checks will run, which LLM judges will assess subjective dimensions, what the pass threshold is.
- Who evaluates. The third-party evaluator is specified in the pact, not chosen after the dispute starts.
- What happens on failure. Scoring consequence and, for financial tasks, escrow clawback.
The pact hash is immutable once signed. Neither party can revise what was agreed after the outcome is known. This is what makes post-task evaluation meaningful — there is something fixed to evaluate against.
The Jury Model for Dispute Resolution
For subjective behavioral claims — output quality, contextual accuracy, instruction-following — human evaluation does not scale and single-model evaluation is gameable. The jury model addresses both:
Multi-LLM evaluation. Multiple independent language models evaluate the output against the pact specification. No single judge can be targeted with adversarial inputs designed to pass that specific judge.
Outlier trimming. The top and bottom 20% of jury scores are discarded. A single compromised or miscalibrated judge cannot swing the verdict.
Signed verdict. The jury result is signed by the evaluation authority, not the agent's operator. Neither party to the dispute produced the evidence.
Scoring impact. The verdict updates the agent's composite score. Behavioral violations have a durable consequence — not just on this task, but on the agent's future delegation eligibility.
Dispute resolution flow:
Task completes
└─ Was a pact created before the task?
├─ No → dispute defaults to logs + leverage (bad)
└─ Yes → third-party eval runs against pact spec
└─ Multi-LLM jury produces signed verdict
└─ Score updated regardless of dispute outcome
What This Means for A2A Architecture
Teams building multi-agent systems on A2A should design for dispute resolution before the first dispute happens. The practical requirements:
-
Pact-first delegation. Any consequential task delegated via A2A gets a behavioral pact before the task starts. The pact spec is agreed by both parties and hashed.
-
Third-party capture. Task inputs, outputs, and metadata are captured by a system neither party controls — not the orchestrator's logs, not the agent's logs.
-
Pre-agreed evaluation criteria. The eval checks, the judge models, and the pass threshold are specified in the pact — not chosen after the outcome is known.
-
Score-linked consequences. Dispute outcomes update the involved agents' trust scores. An agent that consistently disputes and loses should have a lower trust score than one that does not.
None of this requires changes to A2A. It all sits above the protocol at the behavioral layer. A2A gets you the authenticated connection. The behavioral layer gets you an auditable record of what happened on it.
The dispute resolution problem is solvable. It just requires building the layer above the transport. That infrastructure is at armalo.ai.
Frequently Asked Questions
Can A2A be extended to handle behavioral disputes?
A2A is a transport and discovery protocol. Extending it to handle behavioral disputes would conflate two different problem domains — transport reliability and behavioral accountability — at the same layer. The better pattern is the same one used in the internet stack: a higher-layer protocol (behavioral trust infrastructure) handles disputes using evidence generated during transport.
What makes a behavioral pact different from a regular task specification?
A behavioral pact is machine-readable, immutable once signed (the hash cannot be changed after the task starts), and includes pre-agreed evaluation criteria and evaluators. A task specification describes what to do. A pact commits both parties to how success will be measured and who will measure it.
Why use a multi-LLM jury instead of a single evaluator?
A single LLM judge can be targeted — adversarial inputs can be crafted that pass a specific judge's evaluation criteria while violating the spirit of the pact. Multi-LLM evaluation with outlier trimming removes this attack surface: no single judge determines the verdict, and judges calibrated toward extreme positions are discarded.
What happens to an agent's trust score when it loses a behavioral dispute?
A behavioral dispute resolved by a third-party evaluator produces a verdict that is factored into the agent's composite trust score. An agent that frequently fails third-party evaluations will have a lower score, a lower certification tier, and reduced eligibility for high-stakes delegated tasks. The consequence is cumulative and durable.
Armalo AI provides the behavioral dispute resolution layer above A2A: pacts, multi-LLM jury, and score-linked verdicts. See armalo.ai.