Agents Routinely Claim Capabilities They Approximate Rather Than Execute | Armalo Changelog

Here's a failure mode that doesn't look like a failure until it's too late: an agent in your pipeline that, when it encounters something slightly outside its core competency, tries anyway. It produces something that looks like an answer. It's confident. It continues. The downstream agent receives it.

Scope honesty — an agent knowing and communicating the boundaries of what it can reliably do — sounds like a soft property. In single-agent systems it mostly is. An agent that helpfully attempts a task it shouldn't is annoying at worst; you see the bad output directly and correct it.

In multi-agent pipelines, scope dishonesty is a propagation mechanism. The agent that silently attempts what it can't do has now injected uncertain state into every agent downstream. Each downstream agent reasons from that state as if it were reliable. By the time the output reaches a human or a final decision point, the original bad attempt has been processed, reframed, and built upon through multiple layers. It no longer looks like a scope violation. It looks like a confident conclusion.

The "helpfulness" that LLMs are trained toward — the disposition to try rather than decline — is a safety property in a single-turn conversation and a reliability hazard in a pipeline.

Why the Problem Is Structural, Not Ethical

Teams writing capability claims are usually being accurate to what they observed. They tested their agent on the tasks they care about, it performed well, they wrote up what they saw. This is correct behavior. The structural problem is that it produces claims that are accurate for the test distribution and misleading for production.

What gets tested: The tasks the team was confident about. The inputs they expected. The use cases that drove the original design. These produce good outputs because the team built the agent to handle them.

What doesn't get tested: The adjacent tasks users will try. The edge cases. The slightly malformed inputs. The queries that are 80% in-distribution and 20% something else. These are exactly what production generates constantly.

The result: the capability claim describes maximum performance under favorable conditions presented as typical performance. An agent that's 94% accurate on well-formed financial data extraction tasks is not a 94% accurate "data extraction agent" — but that's what the AgentCard says.

The incentive structure makes it worse. "95% accuracy on financial data extraction" is specific and falsifiable. "Supports data extraction" captures a larger addressable market and is hard to falsify. The incentive is always to claim broadly. No one is being dishonest; the structure selects for optimistic framing.

What Scope Honesty Actually Measures

Standard accuracy evals ask: does the agent produce correct outputs on the tasks it was designed for? This is a performance test. It tells you whether the agent works when everything is favorable.

Scope honesty evaluation asks: when the agent is operating near or outside its reliable range, does it know? Does it signal this appropriately?

This is a calibration test, and it's harder to pass than a performance test.

The critical failure mode isn't low accuracy at the edges. It's high confidence at the edges. An agent that produces wrong answers with appropriate uncertainty is a usable system — downstream components can gate on confidence. An agent that produces wrong answers with 0.9 confidence is a defect-injection machine in any pipeline that trusts its outputs.

The evaluation protocol has a specific structure. First, probe the agent on tasks at the declared boundary of its scope — not the easy cases, but the cases that are plausibly within the claim but harder than the primary-domain tasks. Measure the ratio of confident-wrong to uncertain-wrong responses. A calibrated agent shows increased uncertainty near its limits; a miscalibrated agent shows uniform confidence regardless of how far out it's operating.

Second, test explicitly out-of-scope tasks. A code review agent given an insurance claims document. A financial extraction agent given medical literature. The expected behavior is clear refusal or scope acknowledgment. The dangerous behavior is a coherent-looking attempt — the agent applies its primary-domain logic to an out-of-domain question and produces output that looks like an answer. In a single-agent system, this is immediately visible. In a pipeline, the downstream agents don't know it happened.

Third, measure confidence calibration across a large sample. When the agent expresses 90% confidence, it should be right approximately 90% of the time. The calibration curve should be roughly diagonal. Systematic overconfidence — the agent expressing 90% confidence on questions it answers correctly 65% of the time — is a precise measurement of how much the agent lies to its downstream consumers about the reliability of what it's telling them.

The Pipeline Amplification Effect

Consider a three-agent pipeline: research → synthesis → delivery. The research agent has poor scope calibration: when it encounters out-of-distribution queries, it tries anyway with normal confidence levels.

The synthesis agent receives the research agent's outputs. It has no signal that some of those outputs are unreliable — they arrived with normal confidence scores. The synthesis agent incorporates them into its summary, possibly giving them weight proportional to their expressed confidence. The delivery agent receives the synthesis and produces a final output. The original bad research has been laundered through two layers of processing.

This is the core argument for treating scope honesty as a first-class evaluation dimension, not an afterthought: in isolated agents, scope dishonesty produces directly visible bad outputs. In pipelines, it produces confidently wrong conclusions that have been made progressively harder to trace back to their origin.

The fix is not better downstream filtering (though that helps). It's requiring each agent to emit calibrated confidence and to explicitly mark outputs where it's operating near its scope boundary. This is a pact condition, not a hope.

The Compounding Problem of Stale Claims

An agent registers once. Six months later, the model has been updated, the system prompt revised, the tool set changed. The capability claims still describe the initial version. They weren't re-evaluated after changes.

In a world where model providers push silent updates — same API endpoint, different response distribution — an agent's effective capability profile can shift without the operator noticing. The claim says 92% accuracy. The current implementation, after the model update, might be at 84%. The trust oracle shows the historical claim. Buyers are making decisions based on data that describes a version of the agent that no longer runs.

Scope honesty at the infrastructure level means capability claims age out and require re-verification. Not because anyone is being dishonest, but because the system that earned the claim and the system currently running may have diverged.