"Retrieval is a rough sort. Reranking is refinement. The first pass gets you candidates; the second pass gets you the right answer. Skip reranking and you are leaving accuracy on the table."
Head of Search, RetailMind
Introduction
Basic RAG retrieves the top-k most similar chunks from the vector store. But similarity in embedding space is not the same as relevance to the specific query at hand. Reranking uses more sophisticated models to reorder retrieved candidates, dramatically improving retrieval precision. Graph-aware retrieval goes further, leveraging relationships between entities to retrieve context that would be missed by pure content similarity.
This section covers two advanced retrieval techniques: cross-encoder reranking and knowledge graph retrieval. Both require more compute at query time, but the quality gains often justify the cost.
The Two-Stage Retrieval Problem
Vector similarity search is efficient but imprecise. Embedding models optimize for general semantic similarity across training data, not for specific query-document relevance. A bi-encoder architecture that encodes queries and documents independently cannot capture query-document interactions.
Bi-Encoder vs Cross-Encoder
Bi-encoder: Encodes query and documents independently, then computes cosine similarity. Fast (can pre-compute document embeddings) but cannot capture query-document interactions.
Cross-encoder: Encodes query and document together, capturing interactions. Much more accurate but requires computing on-the-fly for every candidate.
Cross-Encoder Reranking
Reranking takes the top-N candidates from vector search (where N > k) and reorders them using a cross-encoder model that better captures query-document relevance.
How Cross-Encoders Work
A cross-encoder takes the query and document as a single input sequence, with special separators marking the boundary. The transformer architecture attends to all tokens in both sequences simultaneously, capturing interactions that bi-encoders miss.
Reranking in Production
Cross-Encoder Model Selection
ms-marco-MiniLM-L-6-v2 is a very fast model suitable for general purpose use cases where latency is critical. It provides good accuracy with a maximum length of 512 tokens, making it ideal for high-traffic production systems where you need to rerank many candidates without introducing noticeable delay.
ms-marco-MiniLM-L-12-v2 offers better accuracy than the smaller variant while maintaining fast speeds. It also caps at 512 tokens and provides a balanced speed/quality trade-off for production systems that need higher accuracy without sacrificing too much latency.
cross-encoder/ms-marco-T5-base is a medium-speed model that delivers excellent accuracy. This model suits applications with high accuracy requirements where you can afford slightly higher latency in exchange for significantly better relevance ranking.
cross-encoder/ms-marco-T5-large provides the best accuracy available but operates at slower speeds. Use this model when maximum accuracy is the primary requirement and latency is not a constraint, such as in offline evaluation pipelines or low-volume but high-stakes queries.
Reranking is Not Free
Reranking adds latency. A cross-encoder must score every candidate from the initial retrieval pass. For 100 candidates at 1ms per scoring, you add 100ms. For 1000 candidates, you add 1 second. Choose initial_k and final_k based on your latency budget. Often the biggest gains come from going from top-50 to top-10, not from top-100 to top-50.
Knowledge Graph Retrieval
Graph-aware retrieval leverages structured relationships between entities. Instead of retrieving chunks by content similarity alone, it also traverses graph connections to find related context. This is especially powerful for questions that require understanding relationships.
When Knowledge Graphs Excel
Knowledge graph retrieval outperforms pure chunk retrieval for queries that require multi-hop reasoning such as discovering the CEO of a company that acquired another company, relationship traversal like finding all products that share a supplier with a given product, hierarchical context such as explaining an error within the broader context of the service that generated it, and entity-centric queries like retrieving a complete history for a specific patient record.
Building a Knowledge Graph
Knowledge graphs represent entities as nodes and relationships as edges. For RAG purposes, you extract entities and relationships from your documents and construct a graph that captures your domain structure.
Hybrid Graph and Vector Retrieval
The most powerful approach combines graph traversal with vector similarity. Use graph traversal to find candidate entities and their neighbors, then use vector search to rank the associated text chunks.
Practical Example: DataForge Knowledge Graph
Challenge: DataForge users ask complex questions about data lineage that require tracing through multiple pipeline stages. Vector search on pipeline configurations misses the relationship context.
Solution: DataForge built a knowledge graph capturing entities including pipelines, tables, columns, databases, and services, relationships such as reads_from, writes_to, transforms, depends_on, and owned_by, and attributes like schema changes, data types, and update frequencies.
Query flow: When a user asks about "customer data flow," the system extracts "customer" and "data flow," traverses the graph to find all connected pipelines and tables, then retrieves and ranks the relevant documentation chunks.
Result: Complex lineage queries that previously required 45 minutes of manual investigation now complete in under 5 seconds with 94% accuracy.
Graph RAG Implementations
Graph-Augmented Retrieval uses the graph to find candidate documents and then applies vector search within those candidates. This medium complexity approach works well for entity-centric domains where you need to identify relevant documents based on entity relationships before fine-tuning the selection through semantic similarity.
Graph-Augmented Generation retrieves a graph substructure and includes it as context directly in the prompt for the LLM. This medium-high complexity approach suits multi-hop reasoning where the answer depends on traversing multiple relationships in sequence.
Full Graph RAG generates answers from the graph structure alone without relying on retrieved text chunks. This high complexity approach works best in highly structured domains where the relationships themselves contain the answer and textual elaboration is unnecessary.
Ensemble Retrieval Strategies
Production RAG systems rarely rely on a single retrieval method. Ensemble approaches combine multiple strategies, weighting their results based on query characteristics and available compute.
Weighted Ensemble
Assign different weights to different retrieval methods based on their reliability for the given query type. Static weights are simple; learned weights that adapt to query characteristics are more powerful.
Query-Dependent Routing
Different queries benefit from different retrieval strategies. Train a lightweight classifier to route queries to the most appropriate retrieval method.
Warning: Reranking Without Enough Candidates
Reranking cannot improve results if the relevant documents are not in the initial candidate set. If the vector search misses the relevant chunk entirely, no amount of reranking will recover it. Ensure your initial retrieval has high recall before optimizing reranking precision.
Section Summary
Two-stage retrieval with cross-encoder reranking dramatically improves precision by reordering vector search candidates using query-document interaction models. Cross-encoders capture relevance signals that bi-encoders miss, at the cost of additional compute. Knowledge graph retrieval excels for multi-hop reasoning and relationship traversal queries. Graph-augmented RAG extracts entities and relationships from documents, enabling traversal-based context discovery. Ensemble retrieval combines multiple strategies with weighted scoring or query-dependent routing. The key insight is that retrieval and ranking serve different purposes: retrieval maximizes recall while reranking optimizes precision. Both stages matter for high-quality RAG.