RAG, Evals & Reliability/

Teaching RAG to say "I don't know"

Two gates — a model-free score floor and an evidence-required reranker that must name what is missing — plus a negative control in the eval set, so a retrieval system can honestly return "nothing here fits."


title: "Teaching RAG to say "I don't know"" series: "RAG, Evals & Reliability" date: "2026-03-10" summary: "Two gates — a model-free score floor and an evidence-required reranker that must name what is missing — plus a negative control in the eval set, so a retrieval system can honestly return "nothing here fits.""

I was building a backend that matches expensive university reading-list citations to free open-textbook replacements. A student's reading list says "chapter 4 of this £60 textbook"; the system looks for an openly-licensed book that actually covers the same material so the course can swap it in.

The failure mode here is specific and nasty: a wrong match is much worse than no match. If the tool confidently recommends a free book that doesn't really cover the topic, a lecturer swaps it in, and students get a worse education for a term. "No suitable replacement found" is a perfectly good, honest answer. The system has to be willing to give it.

The problem is that retrieval-augmented systems are eager to please. Top-k retrieval always returns k results. Ask an LLM "which of these is the best replacement?" and it will pick one, because you asked it to pick one. Nothing in the default pipeline is allowed to say "none of these."

Two gates that make "none" possible

I ended up with two gates between "here are some candidates" and "here is a recommendation."

1. A deterministic score threshold. Retrieval is hybrid: a semantic search over pgvector (768-dim embeddings, HNSW cosine) fused with a Postgres full-text/trigram search, weighted 0.7 semantic / 0.3 lexical. The lexical half matters more than people expect — it catches exact technical terms and author names that embeddings happily smooth over. But fused score is just a similarity; a high similarity to the closest book doesn't mean that book is a real replacement. So the first gate is a hard floor: below it, nothing is eligible, full stop. No model gets to argue.

2. Evidence-required reranking. Candidates that clear the floor go to a reranker (Gemini, forced to return structured JSON), but it isn't asked "rank these." It's asked to classify each candidate as one of replacement / partial / supplement / no_match, and for anything it calls a replacement it has to return the matched topics and the missing topics. The "missing topics" field is the trick — it forces the model to look for absence, not just presence. A book that covers 3 of 8 required topics can't sail through on overall vibe; the gaps are right there in the output, and "too many gaps" downgrades it to partial or no_match.

Neither gate is clever. Together they give the system permission to return nothing.

The test that mattered was a negative control

Most eval sets are lists of inputs with a known right answer. The one that actually told me the gates worked was the opposite: an input with a known no answer.

I fed it a reading-list citation for a UK insolvency-law text — a niche legal topic with no open-textbook equivalent in the corpus. The correct behavior is to refuse. It returned no_match across all 15 candidates, instead of forcing the nearest vaguely-related book into a recommendation.

That negative control did more for my confidence than ten positive matches. Positive cases tell you the system can find things. Negative controls tell you it can stop — which, for a tool whose wrong answers are costly, is the property you're actually shipping. I now think every retrieval eval set is incomplete until it contains inputs whose right answer is "nothing here fits."

Why this generalizes

Swap "free textbook" for "support article," "compliance clause," "prior incident," or "candidate for this role," and the shape is identical: a retrieval system whose confident wrong answers cost real money or trust. The cure is the same three pieces:

  • a hard, model-free floor so similarity alone can't promote junk;
  • a reranker that must cite evidence and name what's missing, not just rank;
  • at least one negative control in your eval set, so you measure the system's willingness to abstain, not just its ability to retrieve.

A RAG system that can't say "I don't know" isn't grounded — it's just confident. Build the off-switch, then test that it fires.

(This was a prototype, not a shipped product — proven end-to-end against live Postgres and a live LLM, on a small gold fixture. I'm not quoting a precision/recall benchmark here because I didn't run one at scale; the negative control is the specific, honest result I'd defend.)