MCP Solved Tool Calling. It Did Not Solve What Happens When the Tool Lies. | Armalo Changelog

The Model Context Protocol is good engineering. It standardized how agents invoke external tools, and the ecosystem adopted it quickly — hundreds of MCP servers, every major agent framework, deep IDE integration. MCP solved the interoperability problem for tool calling. That problem is now solved.

Here is the problem it did not solve, and the one that is currently eating production deployments: when an MCP tool returns a well-formed response, an agent has no native mechanism to evaluate whether that response is trustworthy. The tool output enters the agent's context with the same epistemic weight as ground truth, because the only verification in the stack is schema validation — and schema validation cannot tell you whether the content is accurate.

The Part That's Counterintuitive

You might expect the dangerous failure mode to be tools that return errors. It isn't. Tool errors are loud. The schema breaks, the status code is non-200, the agent knows to fall back.

The dangerous failure mode is a tool that returns a perfectly valid response that happens to be wrong.

Consider: a financial data MCP server caches pricing. Under high load, the cache invalidation job backs up. At 2:15pm, an agent requests a current stock price and gets back schema-valid JSON — correct field names, plausible value — that was accurate at 11:48am. No last_updated field. No cache_hit flag. No staleness signal of any kind. The MCP call succeeds. The agent reasons from the number. The downstream decision is wrong.

The failure is invisible in logs. The tool call succeeded. The response passed schema validation. The model expressed appropriate confidence. Only the outcome was wrong.

This failure pattern repeats in three predictable shapes:

Stale data presented as live. The cache invalidation bug above. Variants: a connector that polls an external API on a cron, a tool that hydrates from a snapshot database that's updated nightly.

Scope violations that produce plausible-looking results. A tool optimized for domain X receives a domain Y query. Instead of returning an error or a low-confidence signal, it applies domain X logic and produces a coherent-but-unreliable answer. Nothing in the response marks the scope mismatch.

Interpolation without disclosure. A tool has no ground-truth data for a specific case and returns a statistically plausible value based on adjacent records. Sometimes the right behavior — but when the response includes no uncertainty signal, the agent cannot distinguish measured data from estimation.

Schema compliance passes every time. Semantic correctness is never checked.

Why This Is Infrastructure, Not a Prompt Problem

The instinct is to patch this with system prompt language: "be skeptical of tool responses," "verify critical data from multiple sources." Some teams add this. It doesn't solve the problem for a specific reason: the model has no independent basis for evaluating tool trustworthiness. It can perform skepticism — hedge, caveat, express uncertainty — but it cannot actually verify what it's skeptical about. Skepticism without a verification mechanism is theater.

The structural problem: there is no layer in the current MCP stack that evaluates tool response quality before the response enters the agent's reasoning context. Every tool response — accurate or stale, ground truth or interpolation — enters with identical epistemic weight. Schema validation is the only gate, and schema validation tells you nothing about content.

What the stack needs:

Behavioral SLAs for MCP servers. A tool claiming to provide real-time financial data should have a behavioral pact: accuracy above a threshold on a defined test suite, staleness below a ceiling, explicit disclosure when responses are interpolated. These commitments should be independently verified and queryable before the tool is invoked — not just documented in a README.

Staleness metadata as a first-class field. HTTP has had Cache-Control and Last-Modified for 30 years. MCP has nothing equivalent. A data_freshness envelope — { data_as_of: "...", is_cached: true, cache_ttl_s: 300 } — would let agents make explicit decisions about whether to proceed based on their staleness tolerance. This is table-stakes transport-layer information.

Jury evaluation for high-stakes tool outputs. For tool responses driving consequential decisions — financial transactions, medical triage, legal research — route the response through independent verification before allowing it to drive action. Not for every call. For the ones where the failure cost exceeds the evaluation cost.

MCP Is TCP

TCP solved the transport problem. It guarantees bytes arrive in order, without corruption. It says nothing about whether what was delivered is true. A server can deliver accurate HTML or adversarial HTML — TCP treats them identically. The application layer handles content.

MCP is TCP for agent tool calling. It specifies the protocol. It guarantees delivery format. It does not guarantee semantic correctness.

We built HTTP before TLS. We deployed e-commerce before fraud detection. Infrastructure layers are built bottom-up; trust is added after transport is stable. MCP transport is stable. The tool output trust layer is what comes next.

The teams that have been burned by stale data or scope-violating tools all built something in response. The teams that haven't been burned haven't needed to. The gap in the spec is that it left this to individual teams rather than making it a first-class protocol concern.

Your agent is currently reasoning from every well-formed tool response as if it were ground truth. The only verification is that the JSON parsed.

Armalo's behavioral pact infrastructure extends to MCP tool servers — define SLAs for your tools, run independent accuracy evaluations, and expose queryable trust scores that agents can check before reasoning from tool output. armalo.ai