← System Design Data & Messaging Systems
System Design

Distributed Search Architecture

Distribute search workloads horizontally by partitioning inverted indexes into isolated, routable shards.

TL;DR
  • Distribute search workloads horizontally by partitioning inverted indexes into isolated, routable shards.
  • Prevent search degradation under high write pressure by tuning the index refresh interval.
  • Mitigate write amplification during bulk indexing by optimizing segment merge policies and disabling replica shards.
  • Ensure consistent relevance scoring across shards by routing related documents to the same shard using custom routing keys.

The Problem

Relational databases are fundamentally unsuited for full-text search. Queries utilizing LIKE '%keyword%' force full table scans, bypass indexes, and cause database CPU exhaustion.

However, scaling a dedicated distributed search engine (like Elasticsearch) introduces severe operational challenges: heavy bulk-indexing operations saturate disk I/O, causing search queries to lag or time out.

Furthermore, if shards are misconfigured, search relevance scores (TF-IDF/BM25) drift across shards, leading to inconsistent and inaccurate search results for the end user.

Core System Idea

Distributed search engines rely on the "Inverted Index" structure, which maps words (tokens) to the documents containing them. To scale, this index is split into multiple physical partitions called "shards." Each shard is a self-contained Lucene index.

The architecture operates on a two-phase query execution model: 1. Query Phase: The client sends a search request to a coordinating node. The coordinator broadcasts the query to a copy (primary or replica) of every shard in the index. Each shard executes the search locally, generating a priority queue of matching document IDs sorted by relevance score, and returns this metadata to the coordinator. 2. Fetch Phase: The coordinator merges the results, performs global re-ranking, and requests the actual document content (the source) only for the top hits (e.g., the top 10 results) from the specific shards holding them.

To balance write throughput and search freshness, the engine uses a "Refresh" mechanism. Newly indexed documents are written to an in-memory buffer. Periodically (e.g., every 1 second), this buffer is flushed to the OS filesystem cache as a new read-only segment, making the documents searchable.

These segments are later merged in the background to control file count and reclaim space from deleted documents.

System Flow

flowchart TD A["Search Query"] --> B["Coordinating Node"] B -->|"1. Broadcast Query"| C["Shard 1"] B -->|"1. Broadcast Query"| D["Shard 2"] C -->|"2. Return Doc IDs and Scores"| B D -->|"2. Return Doc IDs and Scores"| B B -->|"3. Merge and Rank"| B B -->|"4. Fetch Top 10 Docs"| C C -->|"5. Return Full Docs"| B B -->|"6. Return Search Results"| A

Two-phase (Query and Fetch) distributed search execution across partitioned index shards.

Real-World Examples Indicative

Elasticsearch at Uber — 24-shard geo search under 20ms

Uber's trip search index is split across 24 primary shards with 1 replica each. During surge pricing evaluation, coordinating nodes broadcast a geo_distance query to all 24 shards simultaneously; each shard returns its top-20 matching driver locations in under 5ms (P99). The coordinator merges 480 candidates and returns the top 10 globally in under 20ms end-to-end. Uber sets index.refresh_interval=30s during bulk historical trip re-indexing after schema migrations, then reverts to 1s for live production indexes to restore real-time search freshness without interrupting the indexing pipeline.

GitHub code search — custom routing by repository_id cuts fan-out 10x

GitHub's code search index routes all files in a repository to the same shard using repository_id as the routing key. A repository-scoped query (repo:torvalds/linux path:kernel) hits 1-3 shards instead of scatter-gathering across all 20 primary shards — reducing average query fan-out by ~10x. During the 2023 migration to their new Zoekt-based engine, GitHub temporarily set number_of_replicas=0 during 6 weeks of bulk re-indexing of 8M+ repositories, then re-enabled replicas post-migration to restore search HA without incurring redundant indexing work during the data load.

Algolia at Stripe — 50M payment objects with 30s index freshness SLA

Stripe integrates Algolia for instant-search in their Dashboard (payments, customers, products). Algolia's proprietary storage format pre-computes ranking at index time, eliminating the two-phase coordinator fan-out used by Elasticsearch. Stripe indexes 50M+ payment objects and configures searchableAttributes to prioritize payment_intent_id, customer.email, and amount — limiting inverted index depth to reduce per-query I/O. Index freshness SLA is under 30 seconds from Stripe's CDC pipeline to Algolia's index, acceptable for Dashboard lookups where users are not searching for events in the last 30 seconds.

Anti-Patterns

Keeping the Default 1-Second Refresh Interval During Bulk Loads

Leaving the refresh interval at 1 second during massive data ingestion creates thousands of tiny Lucene segments, triggering aggressive background merging that starves disk I/O.

Unbounded Deep Pagination

Requesting page 1000 of a search query (e.g., from: 10000, size: 10) forces the coordinating node to fetch, merge, and sort 10,010 documents from every single shard, leading to out-of-memory (OOM) crashes.

Allowing Automatic Shard Rebalancing During Peak Traffic

Failing to disable shard allocation during node maintenance or network blips causes massive data replication traffic across the network, degrading search performance.

Using Search Engines as the Primary System of Record

Treating Elasticsearch as the source of truth for transactional data is highly risky, as it prioritizes search performance over strict ACID durability.

Design Tradeoffs

DimensionHigh Refresh Frequency (1s)Low Refresh Frequency (30s)
Search freshnessNear real-time; documents are searchable within 1 second of being indexedDelayed; newly indexed documents remain invisible to search queries for up to 30 seconds
Write throughputLow; constant segment creation forces frequent background merges, consuming disk I/O that competes with queriesHigh; documents are batched into fewer, larger segments — fewer merge cycles free up I/O for bulk indexing
Shard routing strategyDefault hash routing distributes documents uniformly, preventing hotspots but requiring all-shard fan-out for every queryCustom routing by tenant or entity ID collapses most queries to 1-3 shards, reducing coordinator overhead 10x

Best Practices

Increase Refresh Interval for Bulk IndexingSet index.refresh_interval to 30s or disable it entirely (-1) during large data migrations to maximize ingestion speed.
Use Custom Routing for Multi-Tenant AppsRoute documents using a tenant identifier (e.g., routing=tenant_id) so that search queries only need to target a single physical shard.
Disable Replicas During Initial Data LoadsTurn off replica shards (number_of_replicas: 0) during bulk indexing, then enable them afterward to avoid redundant indexing work.
Implement Search-After for Deep PaginationAvoid using from and size for deep pagination; use the search_after API (which uses a cursor based on sort values) to retrieve subsequent pages efficiently.
Monitor Segment CountsTrack the number of Lucene segments per shard; an excessively high segment count indicates compaction/merge starvation.

When to Use / Avoid

Use WhenAvoid When
Implementing complex full-text search, auto-complete, fuzzy matching, or synonym search.Performing simple key-value lookups or retrieving records by primary key (use Redis or DynamoDB instead).
Aggregating and analyzing massive volumes of semi-structured log data (e.g., ELK stack).Storing highly relational data that requires frequent multi-table joins and strict ACID transactions.
Building multi-faceted e-commerce product catalogs with dynamic filtering and ranking.The application requires immediate, guaranteed read-your-writes consistency across all updates.