RAG Architecture
Retrieval quality — not the LLM — is the primary failure mode in production RAG systems; bad chunks produce confident, wrong answers.
- Retrieval quality — not the LLM — is the primary failure mode in production RAG systems; bad chunks produce confident, wrong answers.
- Chunking strategy determines retrieval ceiling: fixed 512-token chunks are a starting point, not an answer; semantic and hierarchical chunking reliably outperform them.
- Hybrid retrieval (dense vector + BM25 sparse) consistently beats pure vector search, especially for keyword-heavy queries like product names and error codes.
- Stale embeddings are silent killers — a document updated 6 months ago but never re-indexed poisons every downstream answer that depends on it.
- RAG vs fine-tuning: RAG wins when the knowledge changes frequently or must be auditable; fine-tuning wins when the task requires new reasoning behavior, not new facts.
The Problem
An LLM-powered support bot answers a billing question using the pricing page it was trained on — which was updated three months ago. The customer gets a confident, wrong answer and escalates. This is RAG's core motivation: LLMs are frozen at training cutoff, have no access to proprietary data, and hallucinate when asked about things they don't know. Fine-tuning for every data update is operationally and economically impractical — a model retrain takes hours to days and costs thousands of dollars. RAG solves this by keeping the LLM static while making the knowledge layer dynamic and auditable.
Core System Idea
RAG decouples the LLM's reasoning capability from the knowledge base. Two pipelines run in parallel. The indexing pipeline (offline) chunks source documents into segments of 256–1024 tokens, embeds each chunk using a text embedding model (OpenAI text-embedding-3-large, Cohere embed-v3, or a self-hosted BGE model), and stores vectors in a vector database (Pinecone, Weaviate, Qdrant, Chroma, or pgvector). The retrieval pipeline (online, at query time) embeds the user query, retrieves the top-k most similar chunks (typically k=5–10) using approximate nearest neighbor search, optionally reranks them with a cross-encoder (Cohere Rerank, FlashRank), and injects the selected chunks into the LLM prompt as grounding context. The LLM generates an answer conditioned on the retrieved chunks, not its parametric memory. Source attribution comes for free: each chunk carries its document origin.
System Flow
Query is embedded, top-k chunks retrieved and reranked, then injected as context before the LLM call.
Real-World Examples Indicative
Retrieves live web content at query time, reranks retrieved documents by relevance using a cross-encoder, injects condensed summaries into the prompt, and cites every source explicitly. The system is essentially RAG at web scale — the "vector DB" is a real-time web index, refreshed continuously. Retrieval latency adds ~200–400ms but answer grounding is verifiable.
At suggestion time, Copilot retrieves semantically related code snippets from the user's open files and recently edited files using lightweight embedding-based retrieval. This is RAG with a per-session, ephemeral knowledge base — the LLM never sees the whole repo, only the top-k relevant snippets. It dramatically improves suggestion relevance for project-specific APIs, naming conventions, and patterns.
Chunks the user's workspace documents at paragraph boundaries, embeds them with per-user vector indexes (stored in Pinecone), and retrieves relevant sections at query time. The key design choice: per-user isolation at the index level, not just at query time, which provides both privacy and retrieval precision by eliminating cross-user noise.
Anti-Patterns
Splitting at exactly 512 characters regardless of semantic structure breaks tables, code blocks, and multi-sentence arguments mid-way. The resulting chunks are individually meaningless and fail retrieval even for directly relevant content.
Documents updated in the source system but not re-indexed silently serve outdated answers. Without freshness tracking, you won't know which chunks are stale until a user catches the error.
Vector search alone fails on keyword-heavy queries (product SKUs, error codes, proper nouns) where BM25 sparse retrieval dominates. Production systems without hybrid retrieval leave 15–30% recall on the table.
Retrieving top-20 chunks to "be safe" floods the context with noise, triggering the lost-in-the-middle effect and increasing cost with no quality gain. k=5 with reranking beats k=20 without.
Deploying RAG without logging query–chunk pairs makes it impossible to diagnose retrieval failures. When the LLM gives a wrong answer, you can't tell if it failed because the right chunk was missing, retrieved but not selected, or present but ignored.
Design Tradeoffs
| Dimension | Simple | Advanced |
|---|---|---|
| Chunking | Fixed 512-token splits | Semantic / hierarchical with overlap |
| Embedding refresh | Nightly batch re-index | Event-driven on document change |
| Retrieval method | Pure vector (cosine similarity) | Hybrid (vector + BM25) + cross-encoder reranking |
| Query handling | Raw query → embed → retrieve | Query rewriting + HyDE before retrieval |
Best Practices
When to Use / Avoid
| Use When | Avoid When |
|---|---|
| Knowledge changes frequently (pricing, policies, docs) | Knowledge is static and already in the model's training data |
| Responses must cite specific source documents | Context windows are consistently exhausted even with optimal chunking |
| Fine-tuning is too slow or expensive for the update velocity | Source data is low-quality, inconsistent, or unstructured beyond rescue |
| Proprietary data must stay out of model training for compliance | Sub-100ms latency is required — retrieval adds 50–200ms minimum |