Part IV: Engineering AI Products
Chapter 17.3

Reranking and Graph-Aware Retrieval

"Retrieval is a rough sort. Reranking is refinement. The first pass gets you candidates; the second pass gets you the right answer. Skip reranking and you are leaving accuracy on the table."

Head of Search, RetailMind

Introduction

Basic RAG retrieves the top-k most similar chunks from the vector store. But similarity in embedding space is not the same as relevance to the specific query at hand. Reranking uses more sophisticated models to reorder retrieved candidates, dramatically improving retrieval precision. Graph-aware retrieval goes further, leveraging relationships between entities to retrieve context that would be missed by pure content similarity.

This section covers two advanced retrieval techniques: cross-encoder reranking and knowledge graph retrieval. Both require more compute at query time, but the quality gains often justify the cost.

The Two-Stage Retrieval Problem

Vector similarity search is efficient but imprecise. Embedding models optimize for general semantic similarity across training data, not for specific query-document relevance. A bi-encoder architecture that encodes queries and documents independently cannot capture query-document interactions.

Bi-Encoder vs Cross-Encoder

Bi-encoder: Encodes query and documents independently, then computes cosine similarity. Fast (can pre-compute document embeddings) but cannot capture query-document interactions.

Cross-encoder: Encodes query and document together, capturing interactions. Much more accurate but requires computing on-the-fly for every candidate.

Cross-Encoder Reranking

Reranking takes the top-N candidates from vector search (where N > k) and reorders them using a cross-encoder model that better captures query-document relevance.

[Query] Vector Search
-->
[Top-100] from Vector DB
-->
[Cross-Encoder] Rerank
-->
[Top-10] to LLM

How Cross-Encoders Work

A cross-encoder takes the query and document as a single input sequence, with special separators marking the boundary. The transformer architecture attends to all tokens in both sequences simultaneously, capturing interactions that bi-encoders miss.

# Cross-encoder scoring from sentence_transformers import CrossEncoder # Load a cross-encoder model fine-tuned for relevance model = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2', max_length=512) # Score query-document pairs query = "How do I reset my password?" documents = [ "To reset your password, click the 'Forgot Password' link on the login page.", "Passwords must be at least 8 characters with one uppercase and one number.", "The weather today is sunny with a high of 75 degrees." ] # Get relevance scores scores = model.predict([(query, doc) for doc in documents]) # Returns: [0.95, 0.12, 0.01] # Rank by score ranked_indices = np.argsort(scores)[::-1] # Returns: [0, 1, 2] - correct order!

Reranking in Production

# Production reranking pipeline class RerankingRetriever: def __init__( self, vector_store: VectorStore, reranker: CrossEncoder, initial_k: int = 100, final_k: int = 10 ): self.vector_store = vector_store self.reranker = reranker self.initial_k = initial_k self.final_k = final_k def retrieve(self, query: str, filters: dict = None) -> list[Document]: # Stage 1: Fast vector search candidates = self.vector_store.similarity_search( query, k=self.initial_k, filters=filters ) # Stage 2: Cross-encoder reranking doc_texts = [doc.content for doc in candidates] scores = self.reranker.predict( [(query, doc) for doc in doc_texts] ) # Combine vector similarity with reranker scores # Use Reciprocal Rank Fusion variant combined_scores = [] for i, doc in enumerate(candidates): # Weight: 30% vector similarity, 70% reranker score combined = 0.3 * doc.score + 0.7 * scores[i] combined_scores.append((combined, i)) # Sort and return top-k combined_scores.sort(reverse=True) return [candidates[i] for _, i in combined_scores[:self.final_k]]

Cross-Encoder Model Selection

ms-marco-MiniLM-L-6-v2 is a very fast model suitable for general purpose use cases where latency is critical. It provides good accuracy with a maximum length of 512 tokens, making it ideal for high-traffic production systems where you need to rerank many candidates without introducing noticeable delay.

ms-marco-MiniLM-L-12-v2 offers better accuracy than the smaller variant while maintaining fast speeds. It also caps at 512 tokens and provides a balanced speed/quality trade-off for production systems that need higher accuracy without sacrificing too much latency.

cross-encoder/ms-marco-T5-base is a medium-speed model that delivers excellent accuracy. This model suits applications with high accuracy requirements where you can afford slightly higher latency in exchange for significantly better relevance ranking.

cross-encoder/ms-marco-T5-large provides the best accuracy available but operates at slower speeds. Use this model when maximum accuracy is the primary requirement and latency is not a constraint, such as in offline evaluation pipelines or low-volume but high-stakes queries.

Reranking is Not Free

Reranking adds latency. A cross-encoder must score every candidate from the initial retrieval pass. For 100 candidates at 1ms per scoring, you add 100ms. For 1000 candidates, you add 1 second. Choose initial_k and final_k based on your latency budget. Often the biggest gains come from going from top-50 to top-10, not from top-100 to top-50.

Knowledge Graph Retrieval

Graph-aware retrieval leverages structured relationships between entities. Instead of retrieving chunks by content similarity alone, it also traverses graph connections to find related context. This is especially powerful for questions that require understanding relationships.

When Knowledge Graphs Excel

Knowledge graph retrieval outperforms pure chunk retrieval for queries that require multi-hop reasoning such as discovering the CEO of a company that acquired another company, relationship traversal like finding all products that share a supplier with a given product, hierarchical context such as explaining an error within the broader context of the service that generated it, and entity-centric queries like retrieving a complete history for a specific patient record.

Building a Knowledge Graph

Knowledge graphs represent entities as nodes and relationships as edges. For RAG purposes, you extract entities and relationships from your documents and construct a graph that captures your domain structure.

# Knowledge graph extraction and storage from neo4j import GraphDatabase import spacy class KnowledgeGraphExtractor: def __init__(self, nlp_model, graph_db: GraphDatabase): self.nlp = nlp_model self.graph = graph_db def extract_from_document(self, doc_id: str, text: str): # Parse with spaCy parsed = self.nlp(text) # Extract entities and relationships entities = set() relations = [] for ent in parsed.ents: entities.add((ent.text, ent.label_)) for rel in parsed.root.text, rel_type, rel_obj in parsed.triples: relations.append((rel_type, rel.root.text, rel_obj.text)) # Store in graph with self.graph.session() as session: # Create entity nodes for entity_text, entity_type in entities: session.run(""" MERGE (e:Entity {name: $name, type: $type}) SET e.document_ids = coalesce(e.document_ids, []) + $doc_id """, name=entity_text, type=entity_type, doc_id=doc_id) # Create relationship edges for rel_type, source, target in relations: session.run(""" MATCH (s:Entity {name: $source}) MATCH (t:Entity {name: $target}) MERGE (s)-[r:RELATES {type: $rel_type}]->(t) """, source=source, target=target, rel_type=rel_type)

Hybrid Graph and Vector Retrieval

The most powerful approach combines graph traversal with vector similarity. Use graph traversal to find candidate entities and their neighbors, then use vector search to rank the associated text chunks.

+------------------------------------------------------------------+ | GRAPH-AWARE RAG FLOW | +------------------------------------------------------------------+ | | | Query: "What pipelines touch the orders database?" | | | | [1] Entity Extraction | | "orders database" -> Entity(type: DATABASE, name: orders) | | | | [2] Graph Traversal | | orders_db --[WRITTEN_BY]--> pipeline_a | | --[READS_FROM]--> pipeline_b | | | | [3] Context Collection | | Collect: pipeline_a configs, pipeline_b configs, | | related chunks from vector store | | | | [4] Ranking | | Combine graph proximity + vector similarity | | | +------------------------------------------------------------------+

Practical Example: DataForge Knowledge Graph

Challenge: DataForge users ask complex questions about data lineage that require tracing through multiple pipeline stages. Vector search on pipeline configurations misses the relationship context.

Solution: DataForge built a knowledge graph capturing entities including pipelines, tables, columns, databases, and services, relationships such as reads_from, writes_to, transforms, depends_on, and owned_by, and attributes like schema changes, data types, and update frequencies.

Query flow: When a user asks about "customer data flow," the system extracts "customer" and "data flow," traverses the graph to find all connected pipelines and tables, then retrieves and ranks the relevant documentation chunks.

Result: Complex lineage queries that previously required 45 minutes of manual investigation now complete in under 5 seconds with 94% accuracy.

Graph RAG Implementations

Graph-Augmented Retrieval uses the graph to find candidate documents and then applies vector search within those candidates. This medium complexity approach works well for entity-centric domains where you need to identify relevant documents based on entity relationships before fine-tuning the selection through semantic similarity.

Graph-Augmented Generation retrieves a graph substructure and includes it as context directly in the prompt for the LLM. This medium-high complexity approach suits multi-hop reasoning where the answer depends on traversing multiple relationships in sequence.

Full Graph RAG generates answers from the graph structure alone without relying on retrieved text chunks. This high complexity approach works best in highly structured domains where the relationships themselves contain the answer and textual elaboration is unnecessary.

Ensemble Retrieval Strategies

Production RAG systems rarely rely on a single retrieval method. Ensemble approaches combine multiple strategies, weighting their results based on query characteristics and available compute.

Weighted Ensemble

Assign different weights to different retrieval methods based on their reliability for the given query type. Static weights are simple; learned weights that adapt to query characteristics are more powerful.

# Weighted ensemble retrieval class EnsembleRetriever: def __init__(self, retrievers: dict[str, tuple[Retriever, float]]): """ retrievers: Dict of name -> (retriever_instance, weight) """ self.retrievers = retrievers def retrieve(self, query: str, k: int = 10) -> list[Document]: all_scores = {} # doc_id -> weighted score all_docs = {} # doc_id -> document for name, (retriever, weight) in self.retrievers.items(): results = retriever.retrieve(query, k=k * 2) # Over-retrieve for rank, doc in enumerate(results): # Score based on rank and method weight all_scores[doc.id] = all_scores.get(doc.id, 0) + weight * (1 / (rank + 1)) all_docs[doc.id] = doc # Sort by combined score ranked = sorted(all_scores.items(), key=lambda x: x[1], reverse=True) return [all_docs[doc_id] for doc_id, _ in ranked[:k]]

Query-Dependent Routing

Different queries benefit from different retrieval strategies. Train a lightweight classifier to route queries to the most appropriate retrieval method.

# Query routing classifier QUERY_ROUTING_PROMPT = """Classify this query into one of these categories: - FACTUAL: Simple question with clear answer in documents - ANALYTICAL: Complex question requiring synthesis - ENTITY_FOCUSED: Question about specific entities and relationships - COMPARISON: Question comparing multiple things - PROCEEDURAL: Question about how to do something Query: {query} Category:""" def route_query(query: str) -> str: response = llm.complete(QUERY_ROUTING_PROMPT.format(query=query)) return response.text.strip().lower() # Route to appropriate retriever def retrieve_with_routing(query: str, k: int = 10): category = route_query(query) if category == "entity_focused": return graph_retriever.retrieve(query, k) elif category == "factual": return keyword_retriever.retrieve(query, k) elif category == "analytical": return hybrid_retriever.retrieve(query, k) else: return ensemble_retriever.retrieve(query, k)

Warning: Reranking Without Enough Candidates

Reranking cannot improve results if the relevant documents are not in the initial candidate set. If the vector search misses the relevant chunk entirely, no amount of reranking will recover it. Ensure your initial retrieval has high recall before optimizing reranking precision.

Section Summary

Two-stage retrieval with cross-encoder reranking dramatically improves precision by reordering vector search candidates using query-document interaction models. Cross-encoders capture relevance signals that bi-encoders miss, at the cost of additional compute. Knowledge graph retrieval excels for multi-hop reasoning and relationship traversal queries. Graph-augmented RAG extracts entities and relationships from documents, enabling traversal-based context discovery. Ensemble retrieval combines multiple strategies with weighted scoring or query-dependent routing. The key insight is that retrieval and ranking serve different purposes: retrieval maximizes recall while reranking optimizes precision. Both stages matter for high-quality RAG.