Here's the uncomfortable starting point: most teams shipping RAG systems have no idea what their retrieval quality actually is. They ship, collect vague user feedback, and assume that if the LLM sounds confident, the pipeline is working. It's not a measurement problem — it's a measurement absence problem.
One practitioner building RAG for B2B SaaS companies put it plainly: his first production system had no evals, and 40% of answers were wrong. Not hallucinations in the dramatic sense — just retrieval failures dressed up as confident LLM output. The documents that could answer the question weren't making it into the context window.
The Math That Changes How You Debug
The framing that reorients everything is this:
P(correct answer) ≈ P(correct context retrieved)
If the right chunks aren't retrieved, the LLM cannot answer correctly. Full stop. You can swap models, tune temperature, rewrite system prompts — none of it matters if retrieval is failing upstream. This is why debugging at the generation layer is almost always the wrong move.
The metric that exposes this is Recall@k: of all the documents that should have been retrieved for a given query, what percentage actually made it into the top k results? Practitioners auditing production systems report Recall@10 sitting around 60% on untuned systems. That means 4 out of 10 queries arrive at the LLM missing the document that could have answered them. The LLM hallucinates not because it's broken, but because you handed it an impossible task.
You don't need production traffic to start measuring. Generate 50-100 synthetic question-answer pairs from your existing corpus — ask the LLM to produce specific questions each chunk can answer, record the chunk ID, run your retriever, and measure whether the right chunk lands in the top 10. That number is your baseline. Without it, you're flying blind and calling it a deployment.
Two Fixes That Actually Move the Needle
Production benchmarks across vector databases show Qdrant at 6ms p50 latency for 1M vectors, Pinecone at 8ms, Weaviate at 12ms, and ChromaDB degrading meaningfully above 5M vectors. But raw latency isn't your bottleneck. Retrieval quality is. And two interventions consistently improve it.
Hybrid search. Dense embeddings are excellent at semantic similarity — "reset my password" correctly matches "steps to recover account access." But they're weak on exact terminology: product names, error codes, version numbers, anything where keyword precision matters. BM25 catches what embeddings miss. Combining them via Reciprocal Rank Fusion (RRF) — merging ranked lists from both retrieval methods — consistently outperforms either approach alone. Weaviate has BM25 + vector hybrid built into its query engine natively. Qdrant supports it without an external keyword search layer. If you're running pure vector search, you're leaving retrieval quality on the table.
Chunking strategy. The default "split every 512 tokens" approach is the source of more retrieval failures than any other single decision. Context that spans a chunk boundary gets split; the retrieval system returns half an answer. Semantic chunking — splitting on meaning rather than token count — keeps related content together. The tradeoff is more upfront complexity in your ingestion pipeline, but the Recall@k improvement is measurable and consistent across corpus types.
The pattern I'd argue holds across most small-team RAG deployments: teams optimize the model layer because it's visible and interactive, while the retrieval layer operates as a black box. Fixing that inversion — instrument retrieval first, then optimize — is the architectural shift that actually changes answer quality.
Eval Patterns
Start with synthetic evals before you touch production traffic. Generate question-chunk pairs from your corpus, measure Recall@10, write down the number. Run this before and after every chunking or retrieval change. If you can't show a Recall@10 delta, you don't know if your change helped. Ragas and ARES both provide retrieval-specific eval frameworks worth evaluating for your stack — they separate retrieval quality from generation quality, which is the separation that matters.
Reliability Notes
Retrieval failures are silent. The LLM doesn't return an error when it gets bad context — it returns a confident wrong answer. Add a retrieval confidence layer: log the similarity scores of returned chunks, set a threshold below which you flag the response as low-confidence, and surface that signal to users rather than presenting uncertain output as authoritative. A "I'm not finding strong sources for this" response is more reliable than a hallucinated one.
Cost Watch
Hybrid search adds a BM25 query on top of your vector query, which sounds expensive but typically isn't — BM25 is cheap compute. The real cost question is reindexing. If your chunking strategy is wrong and you need to rechunk and reindex a large corpus, you're paying embedding costs again on every document. Get chunking right before you scale the corpus. Reindexing 10M chunks because you changed your splitter is the kind of cost that doesn't show up in your inference bill until it's already happened.
