Distributed Search Architecture
Distribute search workloads horizontally by partitioning inverted indexes into isolated, routable shards.
- Distribute search workloads horizontally by partitioning inverted indexes into isolated, routable shards.
- Prevent search degradation under high write pressure by tuning the index refresh interval.
- Mitigate write amplification during bulk indexing by optimizing segment merge policies and disabling replica shards.
- Ensure consistent relevance scoring across shards by routing related documents to the same shard using custom routing keys.
The Problem
Relational databases are fundamentally unsuited for full-text search. Queries utilizing LIKE '%keyword%' force full table scans, bypass indexes, and cause database CPU exhaustion.
However, scaling a dedicated distributed search engine (like Elasticsearch) introduces severe operational challenges: heavy bulk-indexing operations saturate disk I/O, causing search queries to lag or time out.
Furthermore, if shards are misconfigured, search relevance scores (TF-IDF/BM25) drift across shards, leading to inconsistent and inaccurate search results for the end user.
Core System Idea
Distributed search engines rely on the "Inverted Index" structure, which maps words (tokens) to the documents containing them. To scale, this index is split into multiple physical partitions called "shards." Each shard is a self-contained Lucene index.
The architecture operates on a two-phase query execution model: 1. Query Phase: The client sends a search request to a coordinating node. The coordinator broadcasts the query to a copy (primary or replica) of every shard in the index. Each shard executes the search locally, generating a priority queue of matching document IDs sorted by relevance score, and returns this metadata to the coordinator. 2. Fetch Phase: The coordinator merges the results, performs global re-ranking, and requests the actual document content (the source) only for the top hits (e.g., the top 10 results) from the specific shards holding them.
To balance write throughput and search freshness, the engine uses a "Refresh" mechanism. Newly indexed documents are written to an in-memory buffer. Periodically (e.g., every 1 second), this buffer is flushed to the OS filesystem cache as a new read-only segment, making the documents searchable.
These segments are later merged in the background to control file count and reclaim space from deleted documents.
System Flow
Two-phase (Query and Fetch) distributed search execution across partitioned index shards.
Real-World Examples Indicative
Uber's trip search index is split across 24 primary shards with 1 replica each. During surge pricing evaluation, coordinating nodes broadcast a geo_distance query to all 24 shards simultaneously; each shard returns its top-20 matching driver locations in under 5ms (P99). The coordinator merges 480 candidates and returns the top 10 globally in under 20ms end-to-end. Uber sets index.refresh_interval=30s during bulk historical trip re-indexing after schema migrations, then reverts to 1s for live production indexes to restore real-time search freshness without interrupting the indexing pipeline.
GitHub's code search index routes all files in a repository to the same shard using repository_id as the routing key. A repository-scoped query (repo:torvalds/linux path:kernel) hits 1-3 shards instead of scatter-gathering across all 20 primary shards — reducing average query fan-out by ~10x. During the 2023 migration to their new Zoekt-based engine, GitHub temporarily set number_of_replicas=0 during 6 weeks of bulk re-indexing of 8M+ repositories, then re-enabled replicas post-migration to restore search HA without incurring redundant indexing work during the data load.
Stripe integrates Algolia for instant-search in their Dashboard (payments, customers, products). Algolia's proprietary storage format pre-computes ranking at index time, eliminating the two-phase coordinator fan-out used by Elasticsearch. Stripe indexes 50M+ payment objects and configures searchableAttributes to prioritize payment_intent_id, customer.email, and amount — limiting inverted index depth to reduce per-query I/O. Index freshness SLA is under 30 seconds from Stripe's CDC pipeline to Algolia's index, acceptable for Dashboard lookups where users are not searching for events in the last 30 seconds.
Anti-Patterns
Leaving the refresh interval at 1 second during massive data ingestion creates thousands of tiny Lucene segments, triggering aggressive background merging that starves disk I/O.
Requesting page 1000 of a search query (e.g., from: 10000, size: 10) forces the coordinating node to fetch, merge, and sort 10,010 documents from every single shard, leading to out-of-memory (OOM) crashes.
Failing to disable shard allocation during node maintenance or network blips causes massive data replication traffic across the network, degrading search performance.
Treating Elasticsearch as the source of truth for transactional data is highly risky, as it prioritizes search performance over strict ACID durability.
Design Tradeoffs
| Dimension | High Refresh Frequency (1s) | Low Refresh Frequency (30s) |
|---|---|---|
| Search freshness | Near real-time; documents are searchable within 1 second of being indexed | Delayed; newly indexed documents remain invisible to search queries for up to 30 seconds |
| Write throughput | Low; constant segment creation forces frequent background merges, consuming disk I/O that competes with queries | High; documents are batched into fewer, larger segments — fewer merge cycles free up I/O for bulk indexing |
| Shard routing strategy | Default hash routing distributes documents uniformly, preventing hotspots but requiring all-shard fan-out for every query | Custom routing by tenant or entity ID collapses most queries to 1-3 shards, reducing coordinator overhead 10x |
Best Practices
index.refresh_interval to 30s or disable it entirely (-1) during large data migrations to maximize ingestion speed.routing=tenant_id) so that search queries only need to target a single physical shard.number_of_replicas: 0) during bulk indexing, then enable them afterward to avoid redundant indexing work.from and size for deep pagination; use the search_after API (which uses a cursor based on sort values) to retrieve subsequent pages efficiently.When to Use / Avoid
| Use When | Avoid When |
|---|---|
| Implementing complex full-text search, auto-complete, fuzzy matching, or synonym search. | Performing simple key-value lookups or retrieving records by primary key (use Redis or DynamoDB instead). |
| Aggregating and analyzing massive volumes of semi-structured log data (e.g., ELK stack). | Storing highly relational data that requires frequent multi-table joins and strict ACID transactions. |
| Building multi-faceted e-commerce product catalogs with dynamic filtering and ranking. | The application requires immediate, guaranteed read-your-writes consistency across all updates. |