← System Design Data & Messaging Systems
System Design

Database Sharding

Scale writes horizontally by partitioning a single logical database across multiple physical database nodes.

TL;DR
  • Scale writes horizontally by partitioning a single logical database across multiple physical database nodes.
  • Select a high-cardinality shard key to prevent hotspot shards and ensure even data distribution.
  • Avoid cross-shard queries by designing schemas that allow most operations to execute on a single shard.
  • Use virtual sharding to minimize the operational cost and data movement during resharding events.

The Problem

When a relational database grows to terabytes of data, vertical scaling (adding CPU, RAM, and SSDs) hits a hard physical and financial ceiling. High write volumes exhaust disk I/O, lock contention degrades transaction performance, and index sizes exceed available memory, forcing slow disk reads.

Furthermore, operations like database backups, schema migrations, and index rebuilds take days to complete, creating massive operational risks and prolonged maintenance windows.

Core System Idea

Database sharding is a horizontal partitioning technique that splits a single logical database into multiple independent physical database instances (shards). Each shard contains a subset of the total data, and no shard has a complete view of the entire dataset.

To route queries, the system uses a deterministic "shard key" (e.g., user_id or tenant_id). An application routing layer or middleware hashes the shard key to determine which physical node holds the requested data.

By ensuring that writes and reads for a specific entity are routed to a single physical node, sharding allows write throughput and storage capacity to scale linearly with the number of nodes.

System Flow

flowchart TD A["Application Client"] -->|"Query: user_id=987"| B["Routing Layer"] B -->|"Hash: 987 mod 3 = 0"| C["Shard Node 0"] B -->|"Hash: 124 mod 3 = 1"| D["Shard Node 1"] B -->|"Hash: 555 mod 3 = 2"| E["Shard Node 2"] C -->|"Execute Local Query"| F[("Physical DB 0")] D -->|"Execute Local Query"| G[("Physical DB 1")] E -->|"Execute Local Query"| H[("Physical DB 2")]

Database routing layer hashing a shard key to distribute read and write queries across isolated physical database instances.

Real-World Examples Indicative

Vitess at YouTube — gh-ost online schema migrations

YouTube runs Vitess to manage horizontal MySQL sharding across thousands of nodes handling billions of video metadata rows. When a schema migration is needed (e.g., adding a column to the videos table), Vitess uses gh-ost to perform online migrations: gh-ost creates a shadow table, copies rows in chunks of 1,000, and applies binlog changes to the shadow table in parallel, then performs an atomic table swap. This allows YouTube to run schema changes on individual shards without any downtime or table locks — a migration that would take 8+ hours on a monolithic MySQL instance completes shard-by-shard in rolling 20-minute windows.

Instagram — shard ID baked into Snowflake ID

Instagram encodes the shard ID directly into each generated entity ID using a Snowflake-style layout: 41 bits timestamp, 13 bits shard ID, 10 bits sequence. This eliminates the need for a routing lookup table entirely — given any photo_id, the routing layer extracts bits 11–23 to determine the target shard in O(1) with zero network hops. Instagram operates 512 logical shards mapped across fewer physical PostgreSQL nodes; adding capacity only requires remapping logical shards to new physical nodes with no ID regeneration.

Shopify shard-by-shop_id — 80K+ checkouts/min on Black Friday

Shopify shards their MySQL cluster by shop_id, ensuring all orders, products, and inventory records for a merchant land on a single shard. This design allows the Black Friday checkout pipeline to sustain 80K+ checkouts/minute without cross-shard joins — an entire merchant's checkout flow, from cart lock to order confirmation, executes within a single physical database node. When a large merchant is projected to create a hot shard (e.g., a flash sale), Shopify pre-migrates that merchant's shard to a dedicated node cluster days before the event.

Anti-Patterns

Choosing a Low-Cardinality Shard Key

Sharding by a key with few unique values (e.g., country_code or status) creates massive hotspot shards that overwhelm individual physical nodes.

Executing Cross-Shard Joins

Performing queries that join tables across different physical shards forces the application or routing layer to pull massive datasets over the network, destroying performance.

Premature Sharding

Implementing sharding before optimizing indexes, introducing read replicas, or partitioning tables locally adds massive architectural complexity without immediate benefit.

Hardcoding Shard-to-Node Mappings

Mapping shard keys directly to physical IP addresses in application code makes resharding or node failover an operational nightmare.

Design Tradeoffs

DimensionHash-Based ShardingRange-Based Sharding
Write distributionDistributes writes uniformly across all shards via hashing, minimizing the risk of hotspot nodesHigh risk of write hotspots on the active shard (e.g., today's date shard receives all new writes)
Range query performanceRequires scatter-gather across all shards for range queries, with O(N shards) network fan-outGroups related data by range onto the same shard, allowing efficient range scans with a single node query
Resharding complexityMoving data requires rehashing a large percentage of keys unless consistent hashing is used to minimize churnSimple; split the active range shard at a boundary into two new shards with no key remapping needed

Best Practices

Select a High-Cardinality Shard KeyUse keys with millions of unique values (e.g., user_id, organization_id) to ensure uniform data and traffic distribution.
Implement Virtual Shards (Logical Shards)Map shard keys to a large number of logical shards (e.g., 1024), and map those logical shards to a smaller number of physical nodes (e.g., 4). This makes scaling out as simple as moving logical shards.
Denormalize Schemas to Avoid JoinsDuplicate lookup data or aggregate tables across shards so that each shard can execute its queries self-containedly.
Use a Distributed ID GeneratorAvoid auto-incrementing database IDs, as they collide across shards; use Snowflake IDs or UUIDs to guarantee global uniqueness.
Monitor Shard Size and I/O SkewSet up alerts to detect when a single shard's storage or CPU utilization deviates significantly from the cluster average.

When to Use / Avoid

Use WhenAvoid When
Write throughput exceeds the physical limits of the largest available single database instance.The dataset fits comfortably in the memory of a single large database instance.
Storage requirements grow beyond the capacity of a single physical disk array.The application relies heavily on complex, multi-table joins and distributed ACID transactions.
Multi-tenant applications where data can be naturally isolated by tenant_id.The query patterns are highly dynamic and cannot be mapped to a single shard key.