Vector Database Design
HNSW gives you 1–5ms P99 query latency with 95%+ recall but costs ~400 bytes per vector in RAM; IVF with product quantization drops that to 32–64 bytes at the cost of recall and query speed.
- HNSW gives you 1–5ms P99 query latency with 95%+ recall but costs ~400 bytes per vector in RAM; IVF with product quantization drops that to 32–64 bytes at the cost of recall and query speed.
- Embedding staleness is operationally silent — a vector index built on yesterday's data answers with yesterday's truth until someone re-indexes.
- Vector index updates are expensive: HNSW requires relinking graph edges on every insert; plan for background indexing pipelines, not synchronous updates.
- 90–95% recall is the production sweet spot for most workloads — chasing 99% recall often doubles latency and memory with no user-visible quality gain.
- PII can be reconstructed from embeddings with inversion attacks — treat vector indexes as sensitive data, not just numeric arrays.
The Problem
Exact nearest-neighbor search over 100M 1536-dimensional vectors (OpenAI's embedding dimension) requires comparing each query vector against every stored vector — roughly 600 billion floating-point multiplications per query. At a realistic query rate of 1000 QPS, that's 600 trillion operations per second, which no single machine can sustain. Without approximate indexing, semantic search, recommendation engines, and RAG retrieval all collapse at scale. The engineering challenge is configuring ANN indexes to hit latency SLOs (typically P99 < 50ms) while maintaining acceptable recall (≥90%) and keeping memory costs sane as the dataset grows to hundreds of millions of vectors.
Core System Idea
Vector databases (Pinecone, Weaviate, Qdrant, Milvus, Chroma, pgvector) build specialized ANN indexes over high-dimensional embeddings. The two dominant index families are HNSW (Hierarchical Navigable Small World), a graph-based structure that navigates a multi-layer proximity graph at query time, and IVF (Inverted File Index), which clusters vectors into k centroids and searches only the nearest n_probe clusters. Both trade recall for speed. For memory-constrained deployments, product quantization (PQ) compresses each vector from full float32 (6KB for 1536 dims) down to 32–64 bytes by encoding sub-vectors with codebooks, with a typical recall drop of 3–8%. In distributed deployments, the index is sharded across nodes; a query coordinator fans out to all shards, collects top-k per shard, and merges results. Writes go through an ingest pipeline that builds or updates index segments asynchronously — synchronous writes at high QPS are prohibitively expensive for HNSW.
System Flow
Query coordinator fans out to shards and merges top-k results; ingest pipeline updates indexes asynchronously.
Real-World Examples Indicative
Manages 10B+ image embeddings at 2048 dimensions. Uses HNSW sharded across hundreds of nodes with product quantization to compress each vector from 8KB to ~256 bytes, staying within RAM budget while hitting P99 < 100ms. Without PQ, storing 10B vectors at float32 would require 80TB of RAM — economically impossible.
Stores ~100M track and user embeddings at 256 dimensions. Uses a two-stage retrieval: ANN to get 500 candidates in ~5ms, then exact re-scoring with richer features to pick the final 20. Switched from Annoy to HNSW and observed a 3× improvement in P99 query latency at the same recall level.
Uses pgvector (PostgreSQL extension) rather than a dedicated vector database. At their scale (millions of documents per tenant), IVFFlat indexes in Postgres are fast enough (10–30ms), avoiding the operational complexity of running a separate vector DB service. Trade-off: loses HNSW-level recall and has no native horizontal sharding — acceptable for their workload.
Anti-Patterns
Inserting vectors into an HNSW index on every API request blocks for 10–100ms per insert as the graph is relinked. At 1000 writes/sec, this saturates the index thread pool. Use a write buffer and background index segment merging (Qdrant and Milvus both support this natively).
A full HNSW rebuild on 100M vectors takes 2–6 hours on a 32-core machine. Teams that don't plan for this get caught when they need to change the embedding model — migration requires re-embedding everything and rebuilding from scratch.
Hardcoding VECTOR(1536) without an abstraction layer locks you to OpenAI's ada-002 dimension. Switching to a better model (e.g., 3072-dim text-embedding-3-large) requires a full schema migration and re-index.
Embedding inversion attacks can recover approximate original text from embeddings, especially for short strings. A leaked vector index is a data breach. Apply the same access controls and encryption you'd apply to the original documents.
Tuning HNSW efSearch to hit 99% recall typically doubles query latency and RAM usage compared to 95% recall. For most recommendation and semantic search workloads, the user cannot perceive the quality difference — profile before over-tuning.
Design Tradeoffs
| Dimension | HNSW | IVF + PQ |
|---|---|---|
| Query latency | 1–5ms P99 | 5–20ms P99 |
| Memory per vector | ~400B (float32, 1536d) | 32–64B (with quantization) |
| Index build time | Slow (hours at 100M vectors) | Fast (parallel k-means) |
| Update cost | High (graph re-link on insert) | Low (reassign to nearest centroid) |
| Recall at same speed | Higher | Lower (3–8% drop with PQ) |
Best Practices
When to Use / Avoid
| Use When | Avoid When |
|---|---|
| Semantic similarity search over millions to billions of vectors at low latency | Exact key-value or primary key lookup — use a relational DB |
| RAG retrieval requires sub-50ms embedding search | Dataset is under 1M vectors — brute-force with pgvector or numpy is simpler |
| Recommendation or anomaly detection workloads where approximate results are acceptable | Every write must be immediately visible to queries — eventual consistency is not tolerable |
| Embedding model is expected to evolve and re-indexing must be manageable | Ultra-low memory budget where even PQ-compressed indexes don't fit |