MCP Solved Tool Calling. It Did Not Solve What Happens When the Tool Lies.
The Model Context Protocol is a serious piece of engineering. It's become the de facto standard for giving AI agents access to external tools, data sources, and APIs. The ecosystem is growing fast — hundreds of MCP servers, all the major agent frameworks, deep IDE integration. MCP solved the interoperability problem for tool calling. That's a real problem. It's now solved.
MCP answers: how should agents invoke external tools in a standard way? It does not answer: how should agents evaluate whether the tool's response is trustworthy? This is the gap that production deployments keep hitting — not tool invocation failures, but tool output failures that look like successes.
What MCP Does and Doesn't Guarantee
When an agent calls an MCP tool, it gets a response. The response conforms to the MCP schema. The tool call succeeded. The agent proceeds.
Nothing in this flow tells the agent whether the response is accurate. Whether it's current. Whether it reflects the actual state of the external system or a stale cache from 48 hours ago. Whether the tool itself is behaving within its declared capability scope.
MCP is a pipe. It makes pipes easy to create and standardize. It says nothing about what's flowing through the pipe.
This seems obvious when stated plainly. In practice, it's the source of a class of production failures that are genuinely hard to catch — precisely because MCP's well-designed schema validation creates a false sense of verification. When the JSON is valid and the fields are present, the agent has no native signal to indicate the content is wrong.
The Confident Nonsense Problem
Here's the failure mode that hurts the most: a tool returns a response that is well-formed, confidence-evoking, and wrong.
Not wrong in an obvious way. Wrong in a way that a language model downstream will interpret as correct and reason from. The JSON is valid. The field names match the schema. The values are plausible. The model has no signal that anything is off.
This is qualitatively different from a tool error. Tool errors are loud. The schema breaks. The status code is non-200. The agent knows to fall back, retry, or escalate. Tool errors are detectable and the current infrastructure handles them well.
Confident nonsense is silent. The response looks like a success. The model proceeds. The downstream decision is made on a foundation that isn't there. And unlike hallucination — where you can debate whether the model knew better — this failure is structural. The tool reported, the agent believed it, and neither layer knew the content was wrong.
Where This Actually Happens
A concrete scenario that gets reproduced frequently in production:
Stale cached data presented as live. A financial data MCP server caches pricing data. The cache invalidation logic has a bug: under high load, the invalidation job backs up and cache entries persist past their intended TTL. The agent asks for current AAPL price at 2:15pm and gets a number that was accurate at 11:48am. The response includes no staleness signal — no last_updated field, no cache_hit: true flag, no indication that this isn't real-time data. The MCP schema validation passes. The model makes a downstream decision on a 2.5-hour-old price.
The failure is invisible in the agent's logs. The tool call succeeded. The response was schema-valid. The model expressed appropriate confidence. Only the outcome was wrong.
Beyond staleness, the patterns cluster around a few predictable areas:
Scope violations that produce plausible-looking results. A tool optimized for queries in domain X receives a domain Y question. Rather than returning an error or a low-confidence signal, the tool applies its domain X logic and produces a coherent-looking but unreliable answer. No indication of scope mismatch in the response.
Confident interpolation without disclosure. A tool is asked about a specific case it has no ground-truth data on. Rather than returning "no data," it returns a statistically plausible value based on adjacent cases. Often this is the right behavior — but the response includes no uncertainty signal distinguishing interpolated values from measured ones. The agent cannot distinguish ground truth from estimation.
Schema compliance ≠ semantic correctness. Validation catches structural problems. A perfectly schema-compliant JSON response that is semantically incorrect passes schema validation with flying colors. The entire MCP toolchain sees a success.
Why This Is an Infrastructure Problem, Not a Prompt Problem
The instinct is to patch this at the prompt layer: "instruct the model to be skeptical of tool responses." Some teams add system prompt language like "always verify critical data from multiple sources before acting." This works at the margins. It doesn't solve the structural problem for three reasons.
First, you can't specify skepticism precisely enough in natural language to be reliable. "Be skeptical of financial data" is ambiguous — skeptical how? To what degree? The model will interpret this inconsistently across calls.
Second, the model has no independent basis for evaluating the tool's trustworthiness. It can express skepticism, but it can't actually verify what it's skeptical about. Skepticism without verification is theater.
Third, the failure mode is structural: there is no layer in the current MCP stack that evaluates tool response quality before the response enters the agent's reasoning context. Every tool response — accurate or stale, ground truth or interpolation — enters with the same epistemic weight. Schema validation is the only gate.
What the infrastructure needs:
Behavioral SLAs for MCP servers. An MCP server that claims to provide accurate financial data should have a behavioral pact: accuracy ≥ X% on a defined test suite, staleness ≤ Y minutes, explicit disclosure when responses are interpolated. These commitments should be independently verified and published alongside the tool registration — queryable before the tool is invoked.
Staleness metadata as a first-class field. The MCP spec could be extended to include a data_freshness envelope that tool servers populate: { data_as_of: "2024-01-15T14:23:00Z", is_cached: true, cache_ttl_s: 300 }. Agents could then make explicit decisions about whether to proceed based on staleness tolerance. This is the MCP equivalent of HTTP cache-control headers — infrastructure that exists in every other data transport and is conspicuously absent here.
Uncertainty propagation. If a tool response has known uncertainty characteristics, those characteristics should propagate into the agent's reasoning. "This response was generated from a cache last updated 47 minutes ago" is a different epistemic state than "this response was retrieved in real-time." Both look identical in a raw JSON response unless the protocol requires disclosure.
Jury evaluation for high-stakes tool outputs. For tool responses that will drive consequential decisions — financial transactions, medical information, legal research — route the response through independent verification before allowing it to drive actions. This doesn't make sense for every tool call, but for high-value calls, the evaluation cost is small relative to the failure cost.
MCP Is TCP
This is the cleanest analogy for understanding the gap.
TCP solved the transport problem for the internet. It specifies the delivery protocol. It guarantees that bytes arrive in order, without corruption, from the sender to the receiver. TCP doesn't guarantee that what was delivered is true. A website can serve accurate HTML or adversarial HTML — TCP treats them identically. The application layer has to care about content.
MCP is TCP for agent tool calling. It specifies the transport protocol. It guarantees delivery format. It doesn't guarantee that what was delivered is semantically correct.
The application layer — in this case, the agent's trust infrastructure — still has to care about content.
We built the HTTP stack before we built TLS. We deployed e-commerce before we built fraud detection. We had TCP-layer packet integrity long before we had application-layer content authenticity. The pattern is consistent: infrastructure layers are built bottom-up, trust is added after the transport is stable.
MCP is stable. The tool output trust layer is what gets built next.
The Practical Question
What's your current approach to evaluating whether an MCP tool response should be trusted before reasoning from it?
Not at the model layer — at the infrastructure layer. Is there a behavioral SLA for the tool? A staleness check? A confidence signal? An independent evaluation result you can query? Or does your agent reason from every well-formed tool response as if it were ground truth, with the only verification being that the JSON parsed?
The teams that have been burned by stale data or scope-violating tool responses have all built something in response to being burned. The ones that haven't been burned yet haven't needed to. The gap in the MCP spec is that it's left this entirely up to individual teams rather than making it a first-class protocol concern.
Armalo's eval infrastructure extends to MCP tool output verification. Define behavioral pacts for your MCP servers, run independent accuracy evaluations, and expose trust scores that your agents can query before reasoning from tool responses. armalo.ai