Agents Routinely Claim Capabilities They Approximate Rather Than Execute
Every AI agent has a capability declaration. An AgentCard. A feature matrix. A system prompt that specifies its purpose. These declarations are how the ecosystem learns what agents are for and what they can be trusted to do.
The problem is that capability claims are almost entirely unverified — and the gap between what agents claim and what they reliably deliver has a specific, predictable structure that the current trust infrastructure doesn't surface.
Scope honesty is the worst trust problem most builders haven't named yet. Not because agents are lying. Because the incentive structure produces overclaiming without anyone intending it.
The Anatomy of a Capability Claim
When a team builds an AI agent, here's how capability claims typically get written:
- The team tests the agent extensively on the tasks they care about
- The agent performs well on those tasks
- They write up the capabilities based on what they observed
- They launch
This process has a systematic blind spot: the test cases are designed to validate that the agent can do the thing, not to characterize the edges where it can't. Test distributions are implicitly favorable — the team knows what the agent is good at and tests for those things. The cases that would reveal limitations are the cases the team didn't think to include.
The result: capability claims accurately describe performance on favorable inputs while being silent about performance on the input distributions that production will actually generate.
Maximum capability is presented as typical capability. An agent achieves the claimed accuracy under ideal conditions, with well-formed inputs, on the task categories it was optimized for. Under realistic production conditions — adversarial inputs, edge cases, domain shifts, high load — the accuracy drops. The claim doesn't say this. It can't, because the team didn't measure it.
Adjacent domains get folded into claimed domains. A code review agent optimized for Python gets listed as supporting "code review." A financial data extraction agent gets listed as supporting "data extraction." The agent will attempt tasks in the adjacent domain and produce something that resembles an answer. The answer is less reliable than primary-domain performance. The capability claim implies equivalent reliability across the whole domain.
Confidence doesn't track accuracy at the edges. Most LLM-based agents express similar confidence regardless of whether they're operating in their core competency or at the edges of their capability envelope. The confidence signal that a well-calibrated system would use to indicate uncertainty is often absent or unreliable.
Claims age without re-evaluation. An agent registers once. The capability claims are written for the initial deployment. Six months later, the model has been updated, the system prompt revised, the tool set changed. The capability claims reflect the initial version. They weren't re-evaluated after changes.
Why This Isn't a Lying Problem
I want to be explicit about this because the framing matters for how you think about solutions.
The teams writing capability claims are usually being honest to the best of their knowledge. They tested their agent. It performed well. They wrote accurate descriptions of what they observed.
The problem is structural, not ethical:
Test distributions don't match production distributions. The set of inputs a team uses for internal validation is not a representative sample of the inputs the production system will receive. Users generate edge cases, unusual formats, ambiguous queries, and adversarial inputs that the validation suite didn't anticipate.
Capability claims are written before edge cases are discovered. Edge cases get discovered over time in production. By definition, you don't know about them when you write the initial capability claims. The claims are accurate to what was known at registration; they're outdated by the time production discovers the boundaries.
There's no third-party verification mechanism. Without independent evaluation against a defined capability specification, claims are unaudited assertions. A team with a rigorous evaluation process and a team with a cursory one produce equivalent-looking AgentCards. Buyers can't distinguish between them.
The incentive structure rewards specificity in the claimed domain and vagueness at the boundaries. "95% accuracy on financial data extraction" is a specific claim that can be falsified. "Supports data extraction" is a broad claim that's hard to falsify and captures a larger addressable market. The incentive is to claim broadly.
What Scope Honesty Evaluation Actually Tests
Standard accuracy evaluation asks: "Does the agent produce correct outputs on in-distribution tasks?" It's a performance test — good for knowing whether the agent works on the tasks it was designed for.
Scope honesty evaluation asks a different question: "Does the agent accurately represent what it can and can't do? When it's outside its competence, does it signal this appropriately?" It's a calibration test.
The evaluation protocol for scope honesty has four components:
Boundary probing. Run the agent on tasks at the edge of its declared scope — tasks that are plausibly within the claimed domain but represent harder cases than the primary-domain tasks. Measure whether the agent produces correct outputs, produces incorrect outputs with high confidence (bad), or signals reduced confidence (good). An agent that knows its own limits will exhibit either correct performance or appropriately hedged performance at the boundaries.
Out-of-scope task testing. Present the agent with tasks clearly outside its declared scope. A financial analysis agent given molecular biology questions. A code review agent given natural language prose editing tasks. The correct behavior is clear refusal with scope acknowledgment. The concerning behavior is confident attempt — the agent applies its primary-domain logic to a clearly out-of-domain question and produces a coherent-looking but unreliable answer.
Confidence calibration measurement. Evaluate expressed confidence distributions against actual accuracy distributions across a large sample. A calibrated agent's confidence bins should match accuracy rates: when the agent expresses 90% confidence, it should be correct 90% of the time. A miscalibrated agent that expresses 90% confidence on questions it answers correctly 60% of the time is systematically misleading downstream systems about the reliability of its outputs.
Claim verification against evaluation results. Compare the agent's stated capabilities to its empirically measured capabilities. The scope honesty score is the ratio of verified performance to claimed performance across the full capability declaration. An agent that claims 95% accuracy and achieves 93% has high scope honesty. An agent that claims 95% and achieves 71% has low scope honesty — regardless of whether the 71% is "good enough" in absolute terms.
The Economic Incentive Problem
Here's the uncomfortable structural truth: the current incentive structure produces overclaiming as the rational equilibrium.
Capability claims are written at registration and rarely updated. There's no cost for overclaiming beyond failed tasks, which may not be tracked in any form that affects the claiming agent's reputation. There's a direct benefit: more claimed capabilities means more potential use cases, more traffic, more revenue.
The only incentive to accurately scope capabilities exists when the cost of claiming capabilities you can't deliver exceeds the benefit of making the claim. That requires two things working together.
Transparent scope honesty scores as a first-class trust dimension. If an agent's scope honesty is independently evaluated and published — shown alongside accuracy and safety in the trust oracle — the cost of overclaiming becomes reputational and visible. Buyers see it. Orchestrators filter on it. The market penalizes it.
Behavioral pacts that make capability claims enforceable. When a capability claim is encoded in a machine-readable pact with defined verification criteria, the claim is testable. Overclaiming against a pact produces verifiable pact violations. The trust score drops. The track record is permanent. Agents that consistently overclaim against their pacts face compounding reputation damage.
Together, these mechanisms shift the incentive calculus: the expected value of overclaiming drops when the probability of being caught rises and the reputational cost of being caught increases. The market selects for accurate claims when the cost of inaccurate ones is made visible.
The Question That Changes Behavior
When you're evaluating an AI agent for a production use case, how do you currently verify that its capability claims are accurate — not just for ideal inputs, but for your specific production input distribution, including the edge cases and adversarial inputs your production environment generates?
If the answer is "we ran it on some samples and it seemed to work," you're relying on the agent's implicit self-report — which is exactly the structure that produces systematic overclaiming.
Scope honesty evaluation produces a different kind of evidence: independent measurement of the gap between what the agent claims and what it consistently delivers, with specific focus on how it behaves at the boundaries of its claimed competence.
Armalo's eval engine includes scope honesty as a native evaluation check — so capability declarations are independently verified, not just self-asserted. armalo.ai/docs/evals