17.1 RAG Foundations - AI-Powered Products

"RAG turns your document store into a conversational knowledge base. But like any powerful tool, it amplifies both your data quality and your data problems. Garbage in, garbage out scales faster with RAG than without it."
Head of AI, DataForge

Introduction

Retrieval-Augmented Generation (RAG) is the architectural pattern that grounds Large Language Model responses in your actual data. Rather than relying solely on knowledge encoded during training, a RAG system retrieves relevant documents at inference time and uses them to inform the model's response. This approach reduces hallucination, enables questions about proprietary information, and allows your AI product to stay current without expensive model retraining.

This section establishes foundational RAG concepts: the retrieval-generation pipeline, the role of embeddings and vector databases, and the critical importance of relevance in retrieved context. Understanding these fundamentals prepares you for the advanced patterns covered in subsequent sections.

The RAG Pipeline

A RAG system operates in two distinct phases: indexing time and query time. Understanding both phases is essential for building effective systems.

Indexing Phase

At indexing time, your documents are processed and stored in a way that enables fast retrieval. The pipeline proceeds as follows:

+------------------------------------------------------------------+ | RAG INDEXING PIPELINE | +------------------------------------------------------------------+ | | | [Documents] --> [Chunking] --> [Embedding] --> [Vector Store] | | | | | | | | v v v v | | Raw PDFs, Split into Generate Store vectors | | web pages, overlapping vector with metadata | | databases chunks representations | | | +------------------------------------------------------------------+

During indexing, documents are first split into chunks (pieces of text small enough to be relevant to a single query but large enough to contain context). Each chunk is then passed through an embedding model that converts the text into a dense vector representation. These vectors are stored in a vector database alongside metadata that enables filtering and tracking.

Query Phase

At query time, the system retrieves the most relevant chunks and uses them to augment the LLM's response.

+------------------------------------------------------------------+ | RAG QUERY PIPELINE | +------------------------------------------------------------------+ | | | [Query] --> [Embedding] --> [Vector Search] --> [Generation] | | | | | | | | v v v v | | User Generate Find k most Prompt LLM | | question query similar chunks with retrieved | | vector context | | | +------------------------------------------------------------------+

When a user asks a question, it is first embedded into a vector. This query vector is compared against all stored chunk vectors using a similarity metric (typically cosine similarity or dot product). The top-k most similar chunks are retrieved and inserted into a prompt along with the original question. The LLM generates a response conditioned on this retrieved context.

Key Insight: The Two Truths of RAG

Every RAG system embodies a trade-off between two competing truths. The retrieval must be precise (returning only highly relevant chunks) while also being comprehensive (returning enough context to fully answer the question). Optimizing for one typically sacrifices the other, and the right balance depends on your use case.

Embeddings and Vector Representations

Embeddings are the language that RAG systems use to represent meaning. An embedding model maps text to a dense vector in a high-dimensional space where semantically similar texts are positioned near each other. This spatial arrangement enables retrieval by proximity.

How Embeddings Work

Modern embedding models are typically transformer-based neural networks trained on massive text corpora. The training objective is to place texts with similar meanings close together in the embedding space. For example, the vectors for "How do I reset my password?" and "Password reset instructions" would be positioned very close to each other, even though they share few exact words.

Embedding Model Selection

Choosing an embedding model involves trade-offs between performance, cost, and speed.

OpenAI text-embedding-3-large produces 3072-dimensional embeddings and excels at general-purpose use cases with mixed content types. Its high quality and strong generalization make it suitable for diverse document collections where you do not know in advance what kinds of queries users will submit. The trade-off is higher cost and slower embedding generation compared to smaller models.

text-embedding-3-small generates 1536-dimensional embeddings and prioritizes speed and cost efficiency over maximum quality. This model works well for high-volume applications where the cost savings compound significantly and where the slightly lower semantic precision does not meaningfully impact retrieval quality.

Cohere embed-multilingual-v3 produces 1024-dimensional embeddings with strong multilingual support built into the model architecture. This makes it ideal for global applications serving users in multiple languages, where a single model handles queries and documents in diverse linguistic contexts without requiring separate indexes per language.

sentence-transformers is an open-source option with variable dimensionality depending on the specific model variant you choose. This family of models offers full customization potential for enterprise teams with specific domain requirements, though it requires more engineering investment to deploy and maintain compared to managed API options.

Vector Databases

A vector database is specialized storage designed for efficient similarity search on high-dimensional vectors. Unlike traditional databases that excel at exact matches, vector databases enable approximate nearest neighbor (ANN) search, finding vectors that are most similar to a query vector.

Key Vector Database Options

Pinecone is a managed cloud vector database that offers zero operational overhead with automatic scaling handled by the vendor. This convenience comes with the risk of vendor lock-in and potentially high costs at scale, making it best suited for teams that prioritize speed of development over long-term infrastructure flexibility.

Weaviate is an open-source option that includes built-in hybrid search capabilities and GraphQL API support. It can be deployed either as self-hosted infrastructure or as a managed service, giving teams flexibility in how they balance operational control against convenience.

Qdrant is an open-source vector database known for high performance and advanced filtering capabilities. The trade-off is that it requires your team to manage infrastructure, making it suitable for organizations with the engineering capacity to operate their own databases while benefiting from the performance characteristics.

Chroma is an open-source option designed for simplicity and excels at prototyping scenarios where you need to get a vector search system running quickly. It is not hardened for production use at scale, so it is appropriate for early-stage development and exploration rather than high-volume production systems.

pgvector extends PostgreSQL to support vector operations within your existing database infrastructure. This enables a single database to handle both your application data and vector storage, which simplifies architecture for teams already invested in PostgreSQL. The limitation is performance at very high vector counts, where specialized vector databases have advantages.

Indexing Strategies

Vector databases use approximate nearest neighbor (ANN) algorithms to balance search speed against accuracy. HNSW (Hierarchical Navigable Small World) is a graph-based index offering excellent query speed at the cost of higher memory usage and slower indexing, making it best for scenarios where query latency is critical. IVF (Inverted File Index) is a cluster-based index that partitions vectors into groups and offers faster indexing than HNSW but slightly slower queries, which makes it suitable for frequently updated corpora. PQ (Product Quantization) compresses vectors for memory efficiency, enabling larger indexes in constrained environments with a slight accuracy trade-off.

DataForge: RAG Architecture in Practice

DataForge provides a useful illustration of RAG fundamentals. Their system helps enterprise users query their data pipelines using natural language. When a user asks "Which pipelines touched customer data last week?", the system must retrieve the relevant pipeline configurations, execution logs, and data lineage information to generate an accurate answer.

Practical Example: DataForge RAG Indexing

Challenge: DataForge indexes terabytes of pipeline configurations, SQL queries, execution logs, and data lineage graphs across hundreds of enterprise tenants.

Solution: They use a tiered indexing strategy:

Hot tier: Recent execution logs from the last seven days stored in Weaviate with HNSW indexing for sub-50ms queries.

Warm tier: Pipeline configurations and lineage stored in PostgreSQL with pgvector for consistency.

Cold tier: Historical archives in object storage with on-demand embedding generation.

Result: Query latency p95 under 200ms across their entire knowledge base, with perfect tenant isolation.

Evaluating RAG: Retrieval vs Generation

RAG systems have two distinct failure modes: retrieval failures and generation failures. Retrieval failures occur when the system fails to find relevant chunks. Generation failures occur when the LLM fails to correctly use the retrieved context. Evaluating both is essential.

Retrieval Metrics

Recall measures what fraction of relevant documents were retrieved, which you measure by annotating which chunks are actually relevant to a sample of queries. Precision measures what fraction of retrieved documents are relevant, where high precision means less noise in the context provided to the model. MRR (Mean Reciprocal Rank) indicates how highly the first relevant result is ranked, which matters when only the top result is used. NDCG (Normalized Discounted Cumulative Gain) accounts for position in ranking, making it useful when retrieving multiple results where the ordering matters for the final answer quality.

Generation Metrics

Answer faithfulness evaluates whether the generated answer accurately reflects the retrieved context rather than adding information not present in the documents. Answer relevance measures whether the generated answer actually addresses the user's original question and provides what they were looking for. Context utilization assesses whether the model appropriately uses all relevant information available in the retrieved context, rather than focusing on only part of it or ignoring critical details.

The Golden Rule of RAG Evaluation

Always evaluate retrieval and generation separately. A perfect retrieval score means nothing if the model ignores the context. Conversely, a brilliant model cannot compensate for irrelevant retrieval. Set retrieval thresholds independently and only measure end-to-end quality once both components meet their baseline.

Common RAG Failure Modes

1. Semantic Drift

Over multiple retrieval cycles or long conversations, the query's semantic intent can drift from the original question. The system retrieves context that is topically related but not actually answering what the user meant to ask.

2. Mid-Context Forgetting

When using a sliding window or iterative retrieval approach, information from earlier in the conversation or document set can be lost. The system forgets critical context that was established earlier.

3. Chunk Boundary Issues

When documents are split into chunks, important information can span chunk boundaries. The retrieval system might retrieve only the part that lacks the full context needed to answer the question.

4. Query-Document Vocabulary Mismatch

Users often ask questions using different vocabulary than the documents. A question about "price" might not retrieve documents that use the term "cost" or "valuation."

Warning: The Chunking Death Spiral

Teams often try to fix retrieval problems by reducing chunk size. Smaller chunks seem to improve precision but reduce context completeness. This leads to retrieval failures on complex questions, prompting even smaller chunks, until the system retrieves fragments that no longer contain enough information for coherent generation. The fix is not smaller chunks but better chunking strategies and potentially query expansion.

Common Misconception

RAG does not eliminate hallucination. Some teams assume that because the LLM is grounded in retrieved documents, its outputs will be factual. But RAG only constrains what the model talks about; it does not guarantee the model correctly interprets or reasons about the retrieved content. A RAG system can confidently retrieve relevant documents and then generate an answer that misrepresents them. Evaluation must cover both retrieval accuracy and generation faithfulness.

Hybrid Retrieval: Combining Search Paradigms (see architecture)

Pure vector search excels at semantic similarity but can miss exact keyword matches. Hybrid retrieval combines vector search with traditional keyword-based search (BM25 or TF-IDF) to capture both semantic and lexical matching.

Fusion strategies like Reciprocal Rank Fusion (RRF) combine results from multiple retrieval methods by rank rather than raw score. This approach is robust to scale differences between methods and often outperforms either method alone.

Section Summary

RAG systems retrieve relevant documents from a knowledge base and use them to ground LLM responses. The indexing pipeline transforms documents into embedded vectors stored in specialized databases. At query time, the user's question is embedded and matched against the vector store. Hybrid retrieval combining semantic and keyword search typically outperforms either alone. Evaluation must separately measure retrieval quality (recall, precision, MRR) and generation quality (faithfulness, relevance). Common failure modes include semantic drift, mid-context forgetting, chunk boundary issues, and vocabulary mismatch.