RAG Solved Knowledge Access. The Harder Thing Is When the Agent Should Say It Doesn't Know.
Retrieval-Augmented Generation is now the default pattern for giving AI agents access to organizational knowledge. The pattern is well-established, well-tooled, and genuinely useful. An agent retrieves relevant context, grounds its response in retrieved documents, produces outputs that are more accurate and less hallucinated than pure LLM generation. RAG addresses the core hallucination problem. The teams that built it solved a real problem.
RAG answers: how should agents access external knowledge when generating responses? It does not answer: how should agents behave when the retrieved context doesn't contain what they need? The second question is where production RAG deployments accumulate their failure debt — silently, in a way that looks like success.
What RAG Gets Right
RAG replaces confabulation with retrieval. The agent checks the knowledge base before generating. The output is grounded in actual documents. Accuracy on knowledge-base questions improves dramatically over pure generation.
This improvement is real and significant. A support agent that retrieves your actual product documentation before answering is measurably more reliable than one generating from base training data. A legal research agent that retrieves relevant case law is more reliable than one guessing at precedent. The retrieval layer is load-bearing.
The honest question is what happens at the boundary — when retrieval works well, RAG works well. When retrieval doesn't work well, and that failure is invisible, RAG fails in the worst possible way: confidently.
The Three Retrieval Scenarios
When a user asks a question, the RAG retrieval step produces one of three outcomes:
High-relevance retrieval. The context window fills with documents that directly address the query. The model generates from good context. The answer is accurate and well-grounded. This is the scenario RAG was designed for and handles well.
Partial-relevance retrieval. The retrieved documents are adjacent to the question but not directly answering it. They're topically related, plausible context, but the actual answer isn't in them. The model, reasoning from adjacent context, generates something that sounds right and often isn't. This failure mode is common, difficult to detect, and produces the category of errors that erodes user trust most gradually — consistently plausible-sounding answers that are subtly wrong.
No-relevance retrieval. The knowledge base doesn't contain documents relevant to the question. The retrieval step returns low-relevance results or nothing. The model, with a near-empty context window, generates from its base training distribution. This is pure hallucination with a RAG wrapper. The system logs show a successful retrieval-and-generation. The answer is fabricated.
The first scenario is handled. The second and third are where RAG fails silently, and both look identical to success in the system's logs: tool call completed, response returned, status 200.
Why Cosine Similarity Is an Insufficient Confidence Proxy
The naive solution is to use the retrieval similarity score as a confidence proxy. If the retrieved context has high cosine similarity to the query, proceed. If similarity is low, hedge or decline.
Two problems with this.
Similarity doesn't imply relevance. A query about employee termination procedures retrieves documents about employee benefits with high cosine similarity. Both are in the "employee" semantic neighborhood. Neither answers the question. The model receives high-similarity but irrelevant context and generates a confident answer about termination procedures based on benefits documents. The similarity score looked fine. The answer was wrong.
This failure mode — topically adjacent retrieval that appears relevant but isn't — is exactly the partial-relevance scenario that produces the most insidious errors. It's more common than the no-relevance case and harder to detect because the similarity scores don't flag it.
The model doesn't know what it doesn't know. A language model reasoning from plausible-but-incorrect context will often generate confident, well-structured responses. The model has no access to ground truth. It doesn't know the context window contains the wrong documents. The confidence signal in the response comes from the model's internal fluency, which is high regardless of whether the context is actually relevant to the question.
The combination — high-similarity retrieval of wrong context, followed by confident model generation — produces the worst failure mode: answers that are wrong and appear authoritative. The user has no signal to disbelieve.
What Calibrated Uncertainty Requires
This problem is structurally tractable. The ingredients are known:
Retrieval confidence stratification. Don't binary-classify retrieval quality. Stratify it: high confidence (retrieved context directly addresses the query), medium confidence (adjacent context, partial answer possible with disclosed uncertainty), low confidence (tangentially related or absent). Each stratum triggers different agent behavior. At low confidence, the agent defaults to "I don't have reliable information on this" rather than generating from weak retrieval. This requires explicit engineering decisions about the confidence thresholds and the behaviors they trigger.
Cross-encoding for relevance verification. Cosine similarity via bi-encoder is fast but imprecise. For high-stakes retrievals, rerank the top-k results through a cross-encoder that scores the query-document pair jointly rather than independently. Cross-encoders are much better at detecting the topically-adjacent-but-irrelevant failure mode that bi-encoders miss. The compute cost is higher. The cost is worth it when the downstream decision is consequential.
Explicit no-answer paths. Design the agent's behavioral contract to include explicit refusal conditions: "When retrieval confidence is below threshold X and no high-confidence documents are available, the agent MUST decline rather than generate from base distribution." This is a pact condition — it's testable and evaluatable. You can run your agent against a test suite of questions that have no answer in the knowledge base and verify that the agent consistently declines rather than hallucinating.
This is the evaluation most builders haven't run. They test whether the agent answers known questions correctly. They don't systematically test whether it declines appropriately on unknown questions. The second test reveals the false confidence rate that the first test completely misses.
Uncertainty acknowledgment as a first-class output. The agent's output schema should include an uncertainty field alongside the answer: { answer: "...", confidence: "medium", retrieval_basis: "adjacent_context", recommend_verification: true }. Downstream systems can route on this signal. A human review queue picks up medium-confidence answers. High-confidence answers proceed automatically. Low-confidence answers are declined or escalated without generation.
The Behavioral Pact for a RAG Agent
A RAG agent's behavioral commitments should explicitly include its uncertainty behavior, not just its accuracy on answerable questions.
Accuracy on answerable questions is the metric everyone measures. The pact condition: "Accuracy ≥ 90% on questions with high-relevance retrieval." This is standard.
Calibration on uncertain questions is the metric almost no one measures. The pact condition: "When retrieval confidence is below threshold X, agent declines in ≥ 95% of cases rather than generating speculatively." This requires a test suite of questions that have no good answer in the knowledge base — questions that the agent should decline. Building this test suite is non-trivial and often deprioritized.
False confidence rate on out-of-distribution queries is the metric that reveals the failure mode. The pact condition: "Rate of confident incorrect answers on queries with no relevant retrieval ≤ 2%." The 2% is illustrative — the right threshold depends on the stakes of the deployment. But the metric itself is what changes how you build. An agent you've never tested on out-of-distribution queries has an unknown false confidence rate. You're trusting its uncertainty handling based on hope, not evidence.
The third condition is the one most builders haven't written. Writing it forces you to build the evaluation infrastructure to measure it, which forces you to discover your actual false confidence rate before production does.
The Question
How does your RAG agent behave when the answer to a user's question isn't in your knowledge base? Is there a tested, evaluated fallback path — with a test suite of unanswerable questions and verified decline behavior? Or is the out-of-distribution behavior an emergent property of the model that you haven't formally characterized?
The answer to that question tells you more about your production risk than your retrieval accuracy score does.
Armalo's behavioral pact framework lets you define and evaluate your RAG agent's uncertainty handling — not just its accuracy on answerable questions, but its calibration when retrieval fails. armalo.ai/docs/pacts