Part IV: Engineering AI Products
Chapter 15.3

Retrieval-Augmented Generation (RAG) Architecture

"RAG is the answer to hallucination that does not require fine-tuning. By grounding responses in retrieved documents, you get the flexibility of generative AI with the factual accuracy of retrieval systems. The secret is knowing when to retrieve, what to retrieve, and how to present it."

A Retrieval Engineer Who Reads Logs

Introduction

Retrieval-augmented generation (RAG) combines the generative capabilities of large language models with the factual accuracy of retrieval systems. When an LLM alone cannot answer questions about your proprietary data, or when you need to reduce hallucination risk, RAG provides an architecture that augments the model's context with retrieved documents. This pattern has become one of the most common architectures for enterprise AI products.

This section covers RAG architecture components, decision criteria for when to use RAG, scaling considerations, and the DataForge example that illustrates a production RAG implementation.

RAG Architecture Components

A RAG system consists of several interconnected components, each of which must be designed and optimized for your specific use case.

+------------------------------------------------------------------+ | RAG ARCHITECTURE OVERVIEW | +------------------------------------------------------------------+ | | | +----------+ INDEXING PIPELINE | | | Document | | | | Sources | +----------+ +-----------+ | | +----+----+ | Chunking |---->| Embedding | | | | +----+-----+ +-----+-----+ | | | | | | | v v v | | +----+----+ +----+-----+ +-----+-----+ | | | Raw | | Chunks | | Vector | | | | Docs | +----+-----+ | Index | | | +----+----+ | +-----+-----+ | | | | | | | +-------+-----------+----------------+ | | | | | v | | +-----+-----+ | | | Vector | | | | Database | | | +-----+-----+ | | | | +----------+ QUERY PIPELINE | | | User | | | | Query | +----------+ +-----------+ | | +----+----+ | Query |---->| Retrieve | | | | | Embed | | Relevant | | | | +----+-----+ | Chunks | | | | | +-----+-----+ | | v v | | | +----+----+ +----+-----+ | | | | User | | Query | | | | | Input | | Vector |----------+ | | +----+----+ +----+-----+ | | | | | +------------------------+ | | | | | v | | +------+-----+ | | | Generate | | | | Response | | | +--------------+ | | | +------------------------------------------------------------------+

Document Ingestion Pipeline

The ingestion pipeline transforms raw documents into a searchable index. This pipeline runs asynchronously when documents are created or updated, not at query time.

Document parsing: Extract text from PDFs, Word documents, HTML, and other formats. This step handles layout analysis, table extraction, and figure handling. Poor parsing at this stage compounds into poor retrieval later.

Chunking: Split documents into manageable pieces that fit within the LLM context window and capture meaningful semantic units. Chunk size depends on your use case: smaller chunks for precise question-answering, larger chunks for maintaining context. Typical chunk sizes range from 256 to 2048 tokens.

Metadata extraction: Extract document metadata (source, date, author, document type) and chunk-level metadata (section headers, page numbers). Metadata enables filtering at retrieval time and provides citation information.

Embedding generation: Transform text chunks into dense vector representations using an embedding model. The choice of embedding model affects retrieval quality. OpenAI's text-embedding-ada-002, Cohere embeddings, and open-source models like sentence-transformers are common choices.

Indexing: Store vectors in a vector database optimized for similarity search. Options include Pinecone, Weaviate, Qdrant, Chroma, and pgvector for PostgreSQL users.

Query Pipeline

The query pipeline handles user questions and generates responses. It runs synchronously at low latency.

Query understanding: Optionally rewrite or expand user queries to improve retrieval. Query expansion, multi-hop question decomposition, and query rewriting techniques can significantly improve retrieval precision.

Retrieval: Convert the query to a vector and find the most similar chunks from the vector database. Similarity metrics include cosine similarity, dot product, and euclidean distance. Retrieve top-k results, where k typically ranges from 3 to 20 depending on chunk size and context window.

Reranking: Optionally rerank retrieved results using a cross-encoder model that considers query-document relevance more carefully than the embedding model's similarity score.

Context assembly: Assemble the retrieved chunks into a prompt that provides the LLM with relevant context. This includes formatting retrieved information, adding citations, and deciding how much context to include.

Generation: Pass the assembled prompt to the LLM and return the generated response. The prompt instructs the model to cite retrieved documents and acknowledge uncertainty.

Vector Database selection involves choosing between cloud-hosted managed services and self-hosted solutions. The key decisions center on whether you want operational overhead traded for convenience and reliability. Pinecone, Weaviate, and Qdrant offer managed cloud services with automatic scaling. For teams with existing PostgreSQL infrastructure, pgvector provides vector search within your current database. Chroma is popular for local development and smaller deployments. Each option presents different trade-offs in cost, scalability, and operational complexity.

Embedding Model choice determines how text gets converted to vectors. The decision involves proprietary versus open source models, model size versus performance trade-offs, and multilingual requirements. OpenAI's ada embeddings offer strong performance with minimal configuration. Cohere provides competitive alternatives with excellent multilingual support. Open-source sentence-transformers models give you full control and can be fine-tuned on domain-specific data, though they require more engineering effort to deploy and maintain.

Chunking Strategy governs how documents get split into retrievable units. Decisions include chunk size, overlap between chunks, and whether to use semantic boundaries or fixed-size splitting. Recursive character splitting attempts to respect natural language boundaries like paragraphs and sentences. Semantic chunking uses embedding similarity to group related content together. The right strategy depends on your use case: smaller chunks work well for precise question-answering while larger chunks preserve more context.

Retrieval Algorithm selection covers similarity metrics, the number of results to retrieve, and whether to combine multiple retrieval methods. Cosine similarity is the most common metric, though dot product and euclidean distance have their uses. Approximate nearest neighbor (ANN) algorithms enable fast retrieval at scale. Many production systems combine vector search with BM25 keyword matching in hybrid retrieval to capture both semantic similarity and exact term matches.

When to Use RAG

RAG is not always the right architecture. Understanding when RAG adds value versus when it adds unnecessary complexity is crucial for architectural decisions.

RAG Adds Value When

You have proprietary knowledge: LLMs trained on public data cannot answer questions about your internal policies, product documentation, or domain-specific information. RAG lets you augment the model's knowledge with your proprietary data.

Hallucination is unacceptable: For factual question-answering, legal research, or medical information, hallucination risks are too high for pure generative approaches. RAG grounds responses in retrieved documents.

Information changes frequently: Unlike fine-tuning, which requires retraining to update knowledge, RAG lets you update the knowledge base without retraining. This is valuable when information changes daily or weekly.

You need citations: RAG naturally provides document citations because retrieval identifies the source documents. This is essential for legal, academic, and enterprise applications.

RAG May Not Be Necessary When

General knowledge suffices: If your use case can be served by the LLM's existing knowledge, RAG adds unnecessary complexity. An AI therapist app using general knowledge does not need RAG.

Latency is critical: RAG adds retrieval and context assembly latency to every query. For sub-second response requirements, a pure generative approach may be necessary.

Simple keyword search would suffice: If users ask factual questions with clear, document-grounded answers, a well-designed search interface might be simpler and more reliable than RAG.

The RAG Decision Framework

Before implementing RAG, ask: (1) Does the LLM already know this? (2) Is hallucination risk acceptable? (3) Does information change frequently? (4) Do users need citations? If the answer to all four is "no," consider simpler approaches. If the answer to any is "yes," RAG is worth evaluating.

Scaling Considerations

As RAG systems grow in data volume and query throughput, several scaling considerations become important.

Indexing Scale

Horizontal scaling: Vector databases support horizontal scaling through sharding. Distribute vector collections across multiple nodes based on document IDs or embedding space partitioning.

Incremental indexing: Avoid rebuilding the entire index when documents update. Implement incremental indexing that updates only affected chunks.

Index optimization: Periodically optimize indexes for performance. This includes vacuuming deleted vectors, rebuilding indexes, and updating statistics.

Query Scale

Connection pooling: Vector database connections should be pooled to handle concurrent queries efficiently. Most managed vector databases handle this automatically, but self-hosted solutions require explicit configuration.

Caching: Cache frequent queries and their results. Query result caching can reduce vector database load by 30-50% for workloads with repeated queries.

Query routing: For large deployments, route queries to different index replicas based on load. This prevents any single replica from becoming a bottleneck.

Embedding Model Scale

Model selection: Larger embedding models generally provide better retrieval quality but higher latency and cost. Evaluate quality/latency/cost tradeoffs for your specific use case.

Batch encoding: For batch indexing operations, use batch encoding APIs to improve throughput. Most embedding providers offer batch endpoints with discounted pricing.

ONNX optimization: Open-source embedding models can be optimized with ONNX runtime for faster inference. This is particularly valuable for self-hosted deployments.

Running Product: DataForge Enterprise

Who: DataForge, a B2B data infrastructure company building an internal search system for their engineering documentation

Situation: DataForge had accumulated thousands of internal documents: architecture decision records, runbooks, post-mortems, API documentation, and team wikis. Engineers spent significant time searching for information.

Problem: Existing keyword search returned irrelevant results, engineers did not know which documents to search, and documentation quality varied significantly across teams.

Dilemma: Should they build a simple search enhancement, implement RAG, or invest in a comprehensive knowledge management system?

Decision: They implemented a RAG system with several innovations: hybrid search combining vector similarity with keyword matching, chunk-level metadata for filtering by team and document type, and LLM-generated summaries for retrieved documents.

How: The system uses a bi-encoder embedding model fine-tuned on their internal documentation, BM25 keyword search combined with vector search (hybrid retrieval), and GPT-4o for generating summaries and expanding user queries. Engineers interact via a Slackbot that returns formatted responses with document citations.

Result: 60% reduction in time engineers spend searching for information, 40% improvement in documentation quality scores (teams updated docs to appear well in search), and full audit trail of who searched for what for security reviews.

Lesson: RAG success depends on document quality as much as technical implementation. DataForge invested in documentation guidelines alongside the RAG system, which proved essential for retrieval quality.

Advanced RAG Patterns

Beyond basic RAG, several advanced patterns address specific challenges.

Hybrid Retrieval

Combining vector search with traditional keyword search (BM25) often outperforms either approach alone. Vector search captures semantic similarity; keyword search captures exact matches. Hybrid retrieval gets the best of both.

Self-Query Retrieval

Use an LLM to extract metadata filters from natural language queries. If a user asks "what is the status of the Newark deployment from the infrastructure team," the system extracts filters for document type=status, team=infrastructure, and location=Newark, then applies these filters during retrieval.

Multi-Hop RAG

For complex questions requiring multiple documents, decompose the question into sub-questions, retrieve documents for each sub-question, and synthesize a final answer. This enables questions like "what happened during the Q2 incident and what was the root cause?"

Corrective RAG

Evaluate retrieved document quality and user satisfaction. If retrieval quality is low, trigger alternative retrieval strategies: query expansion, synonym replacement, or manual document tagging.

Section Summary

RAG architectures combine retrieval systems with generative AI to provide factual, citation-grounded responses. Key components include the document ingestion pipeline (parsing, chunking, embedding, indexing) and the query pipeline (query understanding, retrieval, reranking, generation). RAG is appropriate when you have proprietary knowledge, hallucination is unacceptable, information changes frequently, or citations are required. Scaling considerations include index sharding, query caching, and embedding model optimization. Advanced patterns like hybrid retrieval, self-query, and multi-hop RAG address specific use case requirements.