← System Design AI Systems
System Design

Vector Database Design

HNSW gives you 1–5ms P99 query latency with 95%+ recall but costs ~400 bytes per vector in RAM; IVF with product quantization drops that to 32–64 bytes at the cost of recall and query speed.

TL;DR
  • HNSW gives you 1–5ms P99 query latency with 95%+ recall but costs ~400 bytes per vector in RAM; IVF with product quantization drops that to 32–64 bytes at the cost of recall and query speed.
  • Embedding staleness is operationally silent — a vector index built on yesterday's data answers with yesterday's truth until someone re-indexes.
  • Vector index updates are expensive: HNSW requires relinking graph edges on every insert; plan for background indexing pipelines, not synchronous updates.
  • 90–95% recall is the production sweet spot for most workloads — chasing 99% recall often doubles latency and memory with no user-visible quality gain.
  • PII can be reconstructed from embeddings with inversion attacks — treat vector indexes as sensitive data, not just numeric arrays.

The Problem

Exact nearest-neighbor search over 100M 1536-dimensional vectors (OpenAI's embedding dimension) requires comparing each query vector against every stored vector — roughly 600 billion floating-point multiplications per query. At a realistic query rate of 1000 QPS, that's 600 trillion operations per second, which no single machine can sustain. Without approximate indexing, semantic search, recommendation engines, and RAG retrieval all collapse at scale. The engineering challenge is configuring ANN indexes to hit latency SLOs (typically P99 < 50ms) while maintaining acceptable recall (≥90%) and keeping memory costs sane as the dataset grows to hundreds of millions of vectors.

Core System Idea

Vector databases (Pinecone, Weaviate, Qdrant, Milvus, Chroma, pgvector) build specialized ANN indexes over high-dimensional embeddings. The two dominant index families are HNSW (Hierarchical Navigable Small World), a graph-based structure that navigates a multi-layer proximity graph at query time, and IVF (Inverted File Index), which clusters vectors into k centroids and searches only the nearest n_probe clusters. Both trade recall for speed. For memory-constrained deployments, product quantization (PQ) compresses each vector from full float32 (6KB for 1536 dims) down to 32–64 bytes by encoding sub-vectors with codebooks, with a typical recall drop of 3–8%. In distributed deployments, the index is sharded across nodes; a query coordinator fans out to all shards, collects top-k per shard, and merges results. Writes go through an ingest pipeline that builds or updates index segments asynchronously — synchronous writes at high QPS are prohibitively expensive for HNSW.

System Flow

flowchart TD A["Client"] --> B["Query Coordinator"] B --> C["Vector Shard 1"] B --> D["Vector Shard N"] E["Ingest Service"] --> F["Index Builder"] F --> C F --> D

Query coordinator fans out to shards and merges top-k results; ingest pipeline updates indexes asynchronously.

Real-World Examples Indicative

Pinterest Visual Search

Manages 10B+ image embeddings at 2048 dimensions. Uses HNSW sharded across hundreds of nodes with product quantization to compress each vector from 8KB to ~256 bytes, staying within RAM budget while hitting P99 < 100ms. Without PQ, storing 10B vectors at float32 would require 80TB of RAM — economically impossible.

Spotify Music Recommendations

Stores ~100M track and user embeddings at 256 dimensions. Uses a two-stage retrieval: ANN to get 500 candidates in ~5ms, then exact re-scoring with richer features to pick the final 20. Switched from Annoy to HNSW and observed a 3× improvement in P99 query latency at the same recall level.

Notion Semantic Search

Uses pgvector (PostgreSQL extension) rather than a dedicated vector database. At their scale (millions of documents per tenant), IVFFlat indexes in Postgres are fast enough (10–30ms), avoiding the operational complexity of running a separate vector DB service. Trade-off: loses HNSW-level recall and has no native horizontal sharding — acceptable for their workload.

Anti-Patterns

Synchronous HNSW writes in the hot path

Inserting vectors into an HNSW index on every API request blocks for 10–100ms per insert as the graph is relinked. At 1000 writes/sec, this saturates the index thread pool. Use a write buffer and background index segment merging (Qdrant and Milvus both support this natively).

Ignoring index rebuild cost

A full HNSW rebuild on 100M vectors takes 2–6 hours on a 32-core machine. Teams that don't plan for this get caught when they need to change the embedding model — migration requires re-embedding everything and rebuilding from scratch.

Fixed embedding dimension in the schema

Hardcoding VECTOR(1536) without an abstraction layer locks you to OpenAI's ada-002 dimension. Switching to a better model (e.g., 3072-dim text-embedding-3-large) requires a full schema migration and re-index.

Treating vector indexes as non-sensitive data

Embedding inversion attacks can recover approximate original text from embeddings, especially for short strings. A leaked vector index is a data breach. Apply the same access controls and encryption you'd apply to the original documents.

Chasing 99% recall

Tuning HNSW efSearch to hit 99% recall typically doubles query latency and RAM usage compared to 95% recall. For most recommendation and semantic search workloads, the user cannot perceive the quality difference — profile before over-tuning.

Design Tradeoffs

DimensionHNSWIVF + PQ
Query latency1–5ms P995–20ms P99
Memory per vector~400B (float32, 1536d)32–64B (with quantization)
Index build timeSlow (hours at 100M vectors)Fast (parallel k-means)
Update costHigh (graph re-link on insert)Low (reassign to nearest centroid)
Recall at same speedHigherLower (3–8% drop with PQ)

Best Practices

Separate index build from query serving on different compute resources. HNSW builds are CPU-intensive and will degrade P99 query latency if run on the same nodes.
Monitor recall in production, not just latency. Recall degrades silently as the dataset drifts away from the index's training distribution — add a ground-truth recall check to your weekly ops review.
Use product quantization for datasets over 10M vectors where full float32 RAM cost is prohibitive. Accept the 3–8% recall drop explicitly, don't discover it accidentally.
Abstract the embedding model behind an interface from day one. When you switch models (and you will), you want to re-embed and re-index without touching application code.
Set a soft delete pattern: mark vectors as deleted in a filter field rather than physically removing them from the HNSW graph. Physical deletion from HNSW requires costly graph repair; batch tombstone cleanup during off-peak windows instead.
Apply the same IAM policies, encryption at rest, and audit logging to vector indexes as to the source documents they represent.

When to Use / Avoid

Use WhenAvoid When
Semantic similarity search over millions to billions of vectors at low latencyExact key-value or primary key lookup — use a relational DB
RAG retrieval requires sub-50ms embedding searchDataset is under 1M vectors — brute-force with pgvector or numpy is simpler
Recommendation or anomaly detection workloads where approximate results are acceptableEvery write must be immediately visible to queries — eventual consistency is not tolerable
Embedding model is expected to evolve and re-indexing must be manageableUltra-low memory budget where even PQ-compressed indexes don't fit