"The difference between a RAG system that works and one that fails is often not the model or the vector database. It is how you split your documents. Chunking is where the art lives."
Senior AI Engineer, HealthMetrics
Introduction
The previous section established RAG fundamentals. This section dives into the critical implementation details that determine whether your RAG system actually works in practice. Chunking strategy affects every downstream decision. Indexing parameters determine retrieval quality. Hybrid retrieval combines the best of semantic and keyword search. Master these three elements and your RAG systems will significantly outperform naive implementations.
Chunking Strategies
Chunking is the process of splitting documents into smaller, retrieval-friendly pieces. The right chunk size balances context completeness against retrieval precision. Too large and you retrieve irrelevant context. Too small and you lose the information needed to answer questions. (see context windows)
The Chunking Spectrum
Small Chunks
64-256 tokens
High precision, low recall. Good for factual Q&A with precise answers.Medium Chunks
256-512 tokens
Balanced precision/recall. Good for general purpose RAG.Large Chunks
512-2048 tokens
High recall, lower precision. Good for summarization and complex reasoning.Fixed-Size Chunking
The simplest approach splits text into chunks of a predetermined size (measured in tokens or characters) with optional overlap between chunks. Overlap ensures that context is not lost at chunk boundaries.
Fixed-size chunking is fast and predictable but ignores document structure. It can split sentences mid-way, separate headers from content, and break code across chunks.
For content where semantic coherence matters more than speed, semantic chunking offers a more sophisticated approach.
Semantic Chunking
Semantic chunking uses embeddings to identify natural topic boundaries in text. The algorithm embeds sentences and groups those with high similarity into chunks. This preserves semantic coherence at the cost of variable chunk sizes and slower processing.
Document Structure-Aware Chunking
The most effective chunking strategies respect document structure. Headers, paragraphs, sections, and page boundaries provide natural chunk boundaries. This approach is more complex but produces chunks that retain semantic coherence.
Key Insight: Metadata is as Important as Content
Every chunk should carry rich metadata: document ID, section headers, page numbers, last updated date, author, and any domain-specific tags. This metadata enables filtered retrieval, result attribution, and debugging. A chunk without metadata is an orphan in the retrieval system.
Chunking for Different Content Types
Technical documentation benefits from structure-aware chunking that respects headers and code blocks, with typical sizes of 256-512 tokens. The key consideration is keeping code with its related explanation so that retrieved chunks contain both the code and the context needed to understand it.
Legal documents should use section-based chunking with clause numbering preserved, typically at 128-256 token sizes. The clause hierarchy must be maintained in metadata so that related clauses remain connected even when retrieved separately.
Knowledge base articles work well with paragraph-based chunking at 256-512 tokens. Special care should be taken to keep question-answer pairs together since these are natural retrieval units.
Code repositories should chunk at function or class boundaries using AST parsing for accurate identification of these natural divisions. The entire function should be kept as a single chunk rather than splitting it across chunks.
Conversations and transcripts should use turn-based or topic-segmented approaches, keeping full turns or groups of 5-10 turns together. Speaker identity must be preserved in metadata so that the context of statements remains clear.
Advanced Indexing Techniques
Beyond basic embedding and storage, advanced indexing techniques dramatically improve retrieval quality. These include parent-document retrieval, multi-representation indexing, and domain-specific preprocessing.
Parent Document Retrieval
Parent document retrieval maintains a hierarchy between small retrieval units and larger parent documents. At query time, you retrieve relevant child chunks but return the parent document context. This balances retrieval precision with context completeness.
Multi-Representation Indexing
Different chunk sizes serve different query types. Multi-representation indexing creates multiple embeddings per document at different chunk sizes. This enables the system to choose the optimal representation based on query characteristics.
Domain-Specific Preprocessing
Generic chunking ignores domain-specific structure. For healthcare, preprocessing should extract medical terms and normalize them. For code, extract function signatures and docstrings. For financial documents, extract tables and figures separately.
Practical Example: HealthMetrics Chunking Strategy
Challenge: HealthMetrics indexes clinical guidelines, protocol documents, and research papers. Simple chunking broke apart clinical recommendations from their evidence citations.
Solution: They implemented a domain-specific chunking pipeline that extracts structured sections including Background, Methods, Results, and Conclusions, identifies recommendation blocks which are GRADE evidence statements, links each recommendation to its supporting evidence citations, and stores them as composite chunks containing both the recommendation and citations.
Result: Retrieval precision for clinical queries improved from 62% to 89%. Physicians reported answers felt "more clinically useful."
Hybrid Retrieval Implementation (see evaluation)
Hybrid retrieval combines semantic vector search with keyword-based search. This combination captures both conceptual similarity and exact terminology, which is essential for production RAG systems.
The Case for Hybrid Search
Pure vector search excels at understanding meaning but can miss exact matches that matter. A query for "P50 revenue" should match documents containing "P50" even if the embedding model does not perfectly capture the statistical term. Hybrid search ensures these precise matches are not lost.
Reciprocal Rank Fusion
Reciprocal Rank Fusion (RRF) is the gold standard for combining retrieval results. It ranks results by the reciprocal of their rank in each retrieval method, avoiding score normalization issues.
Query Expansion and Reformulation
Users often ask questions using different vocabulary than your documents. Query expansion uses the LLM to generate alternative phrasings of the query, then retrieves using all variations.
Query Decomposition for Complex Questions
Complex questions often require information from multiple sources. Query decomposition breaks such questions into simpler sub-queries that can be answered independently, then synthesizes the answers.
Query Complexity Routing
Not all queries need the same retrieval strategy. Simple factual questions benefit from precise, small-chunk retrieval. Complex analytical questions need larger context and potentially decomposition. Implement a query complexity classifier to route different queries to appropriate retrieval strategies.
Indexing Pipeline Architecture
Production RAG systems require robust indexing pipelines that handle extraction, transformation, chunking, embedding, and storage. The pipeline must be reliable, monitorable, and incrementally updatable.
Incremental Indexing
Full re-indexing is expensive and often unnecessary. Incremental indexing tracks document versions and only re-indexes changed documents. This requires storing document hashes or timestamps and comparing them at indexing time.
Embedding Cache
Embedding generation is often the most expensive part of indexing. Caching embeddings by content hash avoids recomputing for unchanged documents and enables fast re-indexing.
Section Summary
Chunking strategy fundamentally determines RAG system quality. Fixed-size chunking is simple but ignores document structure. Semantic chunking preserves coherence but is computationally expensive. Structure-aware chunking produces the best results by respecting headers, paragraphs, and domain-specific boundaries. Advanced indexing techniques like parent document retrieval and multi-representation indexing enable both precision and context completeness. Hybrid retrieval combining vector and keyword search outperforms either alone. Query expansion and decomposition address vocabulary mismatch and complex question handling. Production indexing pipelines require incremental updates, embedding caches, and robust monitoring.