RAG Solved Knowledge Access. The Harder Thing Is When the Agent Should Say It Doesn't Know. | Armalo Changelog

RAG is now the default architecture for grounding AI agents in organizational knowledge. Retrieve relevant context, generate from it, reduce hallucination. The pattern works. Teams that built it solved a real problem — agents that generate from base training data confabulate at rates that make them useless for anything involving proprietary or recent information.

But RAG's accuracy improvement comes with a failure mode that most production teams haven't formally characterized: RAG failures are not retrieval failures. They're synthesis failures. And synthesis failures look identical to synthesis successes in your logs.

The Three Retrieval Scenarios

Every RAG query produces one of three outcomes:

High-relevance retrieval. The retrieved chunks directly answer the query. The model generates from good context and produces an accurate response. This is the scenario RAG was designed for.

Partial-relevance retrieval. Retrieved chunks are topically adjacent but don't directly answer the question. The model, reasoning from plausible-but-wrong context, generates something that sounds right and often isn't. This is the most common production failure mode — it's subtle, it compounds over time, and it erodes user trust in a way that's hard to attribute.

No-relevance retrieval. The knowledge base doesn't contain relevant documents. The model generates from base training data with a retrieval wrapper. Pure hallucination with better-looking logs.

The first scenario is handled. The second and third are where RAG accumulates failure debt — and both look identical to success: tool call completed, response generated, status 200.

The Part Builders Get Wrong About Similarity Scores

The naive fix is to use cosine similarity as a confidence proxy: high similarity means good retrieval, low similarity means hedge. Two specific problems make this insufficient.

Similarity does not imply relevance. A query about employee termination procedures retrieves documents about employee benefits with high cosine similarity — both are in the "employee" semantic neighborhood. The model receives high-similarity-but-wrong context and generates a confident answer. The similarity score flagged nothing. This topically-adjacent retrieval failure is more common than the no-relevance case and harder to catch because the scores look fine.

The model doesn't know what it doesn't know. An LLM reasoning from plausible-but-incorrect context generates confident, fluent responses. The model has no access to ground truth. Its confidence signal comes from internal fluency, which is high regardless of whether the context window contains the right documents. The combination — high-similarity wrong context plus confident model generation — produces answers that are wrong and appear authoritative. Users have no signal to disbelieve.

Here is the counterintuitive implication: an agent that says "I don't have reliable information on this" when its retrieval confidence is low is more reliable than an agent that generates fluently from partial context. The decline is a feature. Most RAG implementations haven't designed it as one.

Calibrated Uncertainty Is an Engineering Decision

The problem is tractable. The ingredients are known.

Confidence stratification. Don't binary-classify retrieval quality. Stratify: high (context directly addresses the query), medium (adjacent context, partial answer possible with disclosed uncertainty), low (tangentially related or absent). Each tier triggers different agent behavior. At low confidence, the agent defaults to declining rather than generating from weak retrieval. The thresholds are explicit engineering decisions — which means they can be tested, evaluated, and tuned.

Cross-encoders for relevance verification. Cosine similarity via bi-encoder is fast but imprecise. The topically-adjacent-but-irrelevant failure mode — the most common one — is exactly what bi-encoders miss. Cross-encoders score query-document pairs jointly rather than independently and are dramatically better at this case. The compute cost is higher. For decisions where a confident wrong answer has downstream consequences, the cost is worth it.

Explicit no-answer paths in the behavioral contract. Design the agent's pact to include explicit refusal conditions: "When retrieval confidence is below threshold X and no high-confidence documents are available, the agent MUST decline rather than generate from base distribution." This is testable. You can run a test suite of questions that have no answer in the knowledge base and verify consistent decline behavior.

Most builders haven't run this test. They test whether the agent answers known questions correctly. They don't systematically test whether it declines appropriately on unknown questions. The second test reveals the false confidence rate — the rate at which the agent produces confident wrong answers on out-of-distribution queries — that the first test completely misses.

Uncertainty as a first-class output field. The agent's response schema should include { answer: "...", confidence: "medium", retrieval_basis: "adjacent_context", recommend_verification: true }. Downstream systems route on this signal: high-confidence answers proceed, medium-confidence answers queue for human review, low-confidence answers are declined without generation.

The Pact Condition Nobody Writes

A RAG agent's behavioral commitments should explicitly include uncertainty handling, not just accuracy on answerable questions.

Everyone writes the first condition: "Accuracy ≥ 90% on questions with high-relevance retrieval." This is standard.

Almost nobody writes the second: "When retrieval confidence is below threshold X, agent declines in ≥ 95% of cases rather than generating speculatively." Building this condition requires a test suite of questions that have no good answer in the knowledge base — questions the agent should decline. Creating that suite is non-trivial and routinely deprioritized.

Nobody writes the third: "Rate of confident incorrect answers on out-of-distribution queries ≤ 2%." The threshold is illustrative — the right number depends on deployment stakes. But the metric itself changes how you build. An agent you've never tested on out-of-distribution queries has an unknown false confidence rate. You're trusting its uncertainty handling based on hope, not evidence.

The third condition is the one that, once written, forces you to build evaluation infrastructure to measure it — which forces you to discover your actual false confidence rate before production does.

The Question

How does your RAG agent behave when the answer isn't in your knowledge base? Is there a tested decline path — with a test suite of unanswerable questions and verified decline behavior? Or is out-of-distribution behavior an emergent property of the model that you've never formally characterized?

The answer to that question tells you more about your production risk than your retrieval accuracy score does.

Armalo's behavioral pact framework lets you define and evaluate your RAG agent's uncertainty handling — including decline behavior on unanswerable queries, not just accuracy on answerable ones. armalo.ai