← System Design AI Systems
System Design

RAG Architecture

Retrieval quality — not the LLM — is the primary failure mode in production RAG systems; bad chunks produce confident, wrong answers.

TL;DR
  • Retrieval quality — not the LLM — is the primary failure mode in production RAG systems; bad chunks produce confident, wrong answers.
  • Chunking strategy determines retrieval ceiling: fixed 512-token chunks are a starting point, not an answer; semantic and hierarchical chunking reliably outperform them.
  • Hybrid retrieval (dense vector + BM25 sparse) consistently beats pure vector search, especially for keyword-heavy queries like product names and error codes.
  • Stale embeddings are silent killers — a document updated 6 months ago but never re-indexed poisons every downstream answer that depends on it.
  • RAG vs fine-tuning: RAG wins when the knowledge changes frequently or must be auditable; fine-tuning wins when the task requires new reasoning behavior, not new facts.

The Problem

An LLM-powered support bot answers a billing question using the pricing page it was trained on — which was updated three months ago. The customer gets a confident, wrong answer and escalates. This is RAG's core motivation: LLMs are frozen at training cutoff, have no access to proprietary data, and hallucinate when asked about things they don't know. Fine-tuning for every data update is operationally and economically impractical — a model retrain takes hours to days and costs thousands of dollars. RAG solves this by keeping the LLM static while making the knowledge layer dynamic and auditable.

Core System Idea

RAG decouples the LLM's reasoning capability from the knowledge base. Two pipelines run in parallel. The indexing pipeline (offline) chunks source documents into segments of 256–1024 tokens, embeds each chunk using a text embedding model (OpenAI text-embedding-3-large, Cohere embed-v3, or a self-hosted BGE model), and stores vectors in a vector database (Pinecone, Weaviate, Qdrant, Chroma, or pgvector). The retrieval pipeline (online, at query time) embeds the user query, retrieves the top-k most similar chunks (typically k=5–10) using approximate nearest neighbor search, optionally reranks them with a cross-encoder (Cohere Rerank, FlashRank), and injects the selected chunks into the LLM prompt as grounding context. The LLM generates an answer conditioned on the retrieved chunks, not its parametric memory. Source attribution comes for free: each chunk carries its document origin.

System Flow

flowchart TD A["User Query"] --> B["Query Embedder"] B --> C["Vector DB Search"] C --> D["Reranker"] D --> E["Context Assembler"] E --> F["LLM Inference"] F --> G["Response + Citations"]

Query is embedded, top-k chunks retrieved and reranked, then injected as context before the LLM call.

Real-World Examples Indicative

Perplexity AI

Retrieves live web content at query time, reranks retrieved documents by relevance using a cross-encoder, injects condensed summaries into the prompt, and cites every source explicitly. The system is essentially RAG at web scale — the "vector DB" is a real-time web index, refreshed continuously. Retrieval latency adds ~200–400ms but answer grounding is verifiable.

GitHub Copilot (workspace context)

At suggestion time, Copilot retrieves semantically related code snippets from the user's open files and recently edited files using lightweight embedding-based retrieval. This is RAG with a per-session, ephemeral knowledge base — the LLM never sees the whole repo, only the top-k relevant snippets. It dramatically improves suggestion relevance for project-specific APIs, naming conventions, and patterns.

Notion AI

Chunks the user's workspace documents at paragraph boundaries, embeds them with per-user vector indexes (stored in Pinecone), and retrieves relevant sections at query time. The key design choice: per-user isolation at the index level, not just at query time, which provides both privacy and retrieval precision by eliminating cross-user noise.

Anti-Patterns

Fixed-size chunking at sentence boundaries

Splitting at exactly 512 characters regardless of semantic structure breaks tables, code blocks, and multi-sentence arguments mid-way. The resulting chunks are individually meaningless and fail retrieval even for directly relevant content.

Stale embeddings without pipeline monitoring

Documents updated in the source system but not re-indexed silently serve outdated answers. Without freshness tracking, you won't know which chunks are stale until a user catches the error.

Pure cosine similarity retrieval

Vector search alone fails on keyword-heavy queries (product SKUs, error codes, proper nouns) where BM25 sparse retrieval dominates. Production systems without hybrid retrieval leave 15–30% recall on the table.

Injecting too many chunks

Retrieving top-20 chunks to "be safe" floods the context with noise, triggering the lost-in-the-middle effect and increasing cost with no quality gain. k=5 with reranking beats k=20 without.

No retrieval observability

Deploying RAG without logging query–chunk pairs makes it impossible to diagnose retrieval failures. When the LLM gives a wrong answer, you can't tell if it failed because the right chunk was missing, retrieved but not selected, or present but ignored.

Design Tradeoffs

DimensionSimpleAdvanced
ChunkingFixed 512-token splitsSemantic / hierarchical with overlap
Embedding refreshNightly batch re-indexEvent-driven on document change
Retrieval methodPure vector (cosine similarity)Hybrid (vector + BM25) + cross-encoder reranking
Query handlingRaw query → embed → retrieveQuery rewriting + HyDE before retrieval

Best Practices

Chunk at semantic boundaries (paragraph, section, code block) with 10–20% overlap between adjacent chunks to preserve context across splits.
Track embedding freshness as a first-class metric: alert when any document's embedding age exceeds your staleness SLO (e.g., 24 hours for a policy doc, 1 hour for a pricing page).
Use hybrid retrieval by default (Weaviate, Qdrant, and Elasticsearch all support it natively): dense vector for semantic similarity, BM25 for lexical match, combined with RRF (Reciprocal Rank Fusion) or a learned reranker.
Cap retrieved context at 1500–2000 tokens (roughly 5 well-sized chunks) and prefer fewer, higher-quality chunks over more, noisier ones.
Log every retrieval event: query text, retrieved chunk IDs, chunk scores, and whether the user rated the answer as correct. This data is essential for iterative improvement of chunk size and retrieval parameters.
Always include source document metadata in the injected context (document title, section, last-updated date) so the LLM can cite sources and you can audit answer provenance.

When to Use / Avoid

Use WhenAvoid When
Knowledge changes frequently (pricing, policies, docs)Knowledge is static and already in the model's training data
Responses must cite specific source documentsContext windows are consistently exhausted even with optimal chunking
Fine-tuning is too slow or expensive for the update velocitySource data is low-quality, inconsistent, or unstructured beyond rescue
Proprietary data must stay out of model training for complianceSub-100ms latency is required — retrieval adds 50–200ms minimum