"Retrieval failures are the silent killers of RAG systems. The model cannot answer correctly if it never sees the right documents. Debug retrieval first, model second."
A RAG Engineer With Many Failures
Retrieval Failure Modes
Retrieval-augmented generation depends on retrieving relevant context before generation. When retrieval fails, the entire system fails, regardless of how capable the language model is. Understanding retrieval failure modes is essential for debugging RAG systems.
Semantic Mismatch
Semantic mismatch occurs when the query embedding and document embeddings exist in different semantic spaces, preventing relevant documents from being retrieved. This happens when different embedding models are used for indexing and retrieval, creating incompatible vector spaces. It also occurs when query language differs significantly from document language, such as when users ask questions in casual language while documents use formal terminology. Domain-specific terminology interpreted differently across contexts can cause mismatches, as can updating embedding models without re-indexing the corpus, which produces embeddings in a different space than the original indexed documents.
Chunk Boundary Problems
Chunk boundary problems occur when relevant information is split across chunks, making it invisible to retrieval because each chunk is indexed independently. Table rows split across chunks cause retrieval to return incomplete information that lacks the context of neighboring rows. Sentences referring to antecedents in previous chunks become incomprehensible when retrieved in isolation. Lists where item ordering provides essential context lose meaning when items are scattered across different chunks. Code snippets where variable definitions are in other chunks become impossible to understand without the full context.
Index Coverage Gaps
Index coverage gaps occur when information exists in the corpus but was not indexed, making it permanently invisible to retrieval. New documents added without re-indexing means the content exists but cannot be retrieved because the index was not updated. Filtered-out document types that still contain relevant content occur when content filtering rules are too aggressive and exclude documents that should be searchable. Metadata filters that accidentally exclude relevant documents can prevent valid content from appearing in results even when queries are well-formed. Time-based filters that exclude stale but relevant documents can remove important historical information from the retrieval scope.
The Retrieval Quality Assumption
Many teams assume retrieval quality by default. They test generation extensively but give retrieval a pass because "we get results." The dangerous assumption is that retrieved documents are actually providing the right context. Always verify retrieval quality independently from generation quality.
Retrieval Evaluation Metrics
Recall at K
What fraction of relevant documents appear in the top K results? For debugging, evaluate recall at small K values (1, 3, 5) because only a few documents can fit in the context window.
def recall_at_k(
queries: list[str],
relevant_docs: dict[str, set[str]],
retrieved_docs: dict[str, list[str]],
k: int
) -> float:
"""
Calculate recall@k for retrieval evaluation.
Args:
queries: List of test queries
relevant_docs: Mapping of query to set of relevant document IDs
retrieved_docs: Mapping of query to list of retrieved document IDs (ordered)
k: Number of top results to consider
"""
recall_scores = []
for query in queries:
relevant = relevant_docs.get(query, set())
retrieved = set(retrieved_docs.get(query, [])[:k])
if len(relevant) == 0:
continue # Skip queries with no relevant docs
recall = len(relevant & retrieved) / len(relevant)
recall_scores.append(recall)
return sum(recall_scores) / len(recall_scores) if recall_scores else 0.0
Mean Reciprocal Rank
At what rank does the first relevant document appear? MRR penalizes systems where relevant documents appear at low ranks.
NDCG (Normalized Discounted Cumulative Gain)
Considers both relevance and position. A relevant document at rank 3 is worth less than a relevant document at rank 1.
Choosing the Right Metric
Recall@K matters most when all retrieved documents go into context. MRR matters when only the top result affects output. NDCG matters when you care about graded relevance. For RAG systems, start with recall@3 as your primary metric.
Debugging Techniques
Retrieval Inspection Workflow
When a RAG query fails, systematically inspect retrieval by following a structured workflow. First, capture the query by recording the exact text submitted to retrieval for reproduction and analysis. Second, inspect embeddings by verifying the generated query embedding to ensure it was created correctly. Third, examine top results by reviewing the actual retrieved documents to see what the system considered relevant. Fourth, check relevance scores to determine whether they are consistent with expectations or indicate ranking problems. Fifth, verify index coverage by checking whether the relevant document is in the index at all, which rules out index coverage gaps.
Classifying Retrieval Failures
from enum import Enum
from dataclasses import dataclass
class RetrievalFailureType(Enum):
SEMANTIC_MISMATCH = "semantic_mismatch"
CHUNK_BOUNDARY = "chunk_boundary"
INDEX_COVERAGE = "index_coverage"
METADATA_FILTER = "metadata_filter"
RANKING_MODEL = "ranking_model"
UNKNOWN = "unknown"
@dataclass
class RetrievalDebugResult:
failure_type: RetrievalFailureType
query: str
expected_doc_ids: list[str]
retrieved_doc_ids: list[str]
retrieved_scores: list[float]
root_cause: str
suggested_fix: str
def diagnose_retrieval_failure(
query: str,
expected_docs: list[str],
retrieved_docs: list[dict]
) -> RetrievalDebugResult:
"""
Classify why retrieval failed for a specific query.
"""
retrieved_ids = [doc["id"] for doc in retrieved_docs]
# Check if relevant docs are in index at all
if any(doc_id not in index.document_ids for doc_id in expected_docs):
return RetrievalDebugResult(
failure_type=RetrievalFailureType.INDEX_COVERAGE,
query=query,
expected_doc_ids=expected_docs,
retrieved_doc_ids=retrieved_ids,
retrieved_scores=[doc["score"] for doc in retrieved_docs],
root_cause="Expected documents missing from index",
suggested_fix="Re-index corpus or check indexing pipeline"
)
# Check if relevant docs were retrieved but ranked low
relevant_in_results = set(expected_docs) & set(retrieved_ids)
if relevant_in_results:
for doc_id in relevant_in_results:
rank = retrieved_ids.index(doc_id)
score = retrieved_docs[rank]["score"]
if rank > 3:
return RetrievalDebugResult(
failure_type=RetrievalFailureType.RANKING_MODEL,
query=query,
expected_doc_ids=expected_docs,
retrieved_doc_ids=retrieved_ids,
retrieved_scores=[doc["score"] for doc in retrieved_docs],
root_cause=f"Document {doc_id} retrieved at rank {rank+1} with score {score}",
suggested_fix="Adjust embedding model or re-ranking strategy"
)
# If not retrieved at all, likely semantic mismatch
return RetrievalDebugResult(
failure_type=RetrievalFailureType.SEMANTIC_MISMATCH,
query=query,
expected_doc_ids=expected_docs,
retrieved_doc_ids=retrieved_ids,
retrieved_scores=[doc["score"] for doc in retrieved_docs],
root_cause="Relevant documents not retrieved - semantic gap likely",
suggested_fix="Review query-document embedding alignment"
)
Practical Example: HealthMetrics Medical Guidelines RAG
The HealthMetrics team was debugging a clinical decision support RAG system after physicians reported that the system missed critical drug interaction warnings. The model was confident but wrong because retrieval had failed, leading to a dilemma about whether this was a model hallucination or a retrieval failure.
The team decided to build a retrieval debug dashboard that shows retrieved documents alongside queries to investigate systematically. For failing queries, they discovered that drug interaction information existed in a "clinical pharmacology" document set that was excluded from indexing due to a metadata filter error. They also found that chunking had split tables of drug interactions across chunks, making the information invisible to retrieval.
After fixing the metadata filter and re-chunking documents using semantic boundaries around tables, Recall@5 improved from 0.62 to 0.94. The lesson is that the model was not hallucinating. It simply never saw the relevant information. Always verify retrieval before blaming the model.
Retrieval Debugging Tooling
Retrieval Debug Dashboard
Build an internal tool that shows query text and embedding alongside retrieval results for any query you want to investigate. The dashboard should display the top ten retrieved documents with their similarity scores so you can see exactly what the system returned. It should show expected documents from your test set and their positions in the results to identify when relevant documents are being missed or poorly ranked. Include recall and MRR metrics for each query to quantify retrieval performance and track improvements over time.
Retrieval Regression Testing
Create a benchmark set of queries with known relevant documents. Run before any index or embedding model change. Alert if recall drops below threshold.
class RetrievalRegressionSuite:
def __init__(self, benchmark: RetrievalBenchmark, threshold: float = 0.9):
self.benchmark = benchmark
self.threshold = threshold
def run(self, retriever: Retriever) -> RegressionResult:
results = []
for query_data in self.benchmark.test_cases:
retrieved = retriever.search(
query_data.query,
k=10
)
metrics = compute_retrieval_metrics(
query_data.relevant_docs,
[doc.id for doc in retrieved]
)
results.append({
"query": query_data.query,
"recall@5": metrics.recall_at_5,
"mrr": metrics.mrr,
"passed": metrics.recall_at_5 >= self.threshold
})
return RegressionResult(
total=len(results),
passed=sum(1 for r in results if r["passed"]),
failed=[r for r in results if not r["passed"]],
overall_recall=sum(r["recall@5"] for r in results) / len(results)
)
Research Frontier
Research on "retrieval uncertainty" explores methods to identify when the retrieval system is uncertain about its results. By detecting low-confidence retrieval states, systems can either expand retrieval or flag uncertainty for human review before generation.