← System Design Data & Messaging Systems
System Design

Data Partitioning Strategies

Prevent full table scans on multi-terabyte datasets by partitioning tables into smaller, physical files.

TL;DR
  • Prevent full table scans on multi-terabyte datasets by partitioning tables into smaller, physical files.
  • Select partitioning keys that align with your query patterns to enable partition pruning.
  • Avoid partition skew by monitoring and adjusting boundaries to prevent single disks from running out of space.
  • Keep partition counts balanced to avoid metadata overhead and query planner degradation.

The Problem

As database tables grow into hundreds of millions of rows, query performance degrades exponentially. Even with proper indexing, B-Tree index depths increase, requiring more disk I/O operations per lookup. Maintenance tasks like rebuilding indexes, deleting old historical data, or running analytical queries require massive table locks and full table scans.

This leads to query timeouts, high CPU utilization, and operational gridlock where database administrators cannot safely modify schemas or purge expired data.

Core System Idea

Data partitioning is the process of physically dividing a single logical table into smaller, self-contained physical files (partitions) on disk, while presenting a single unified table interface to the application. The database engine automatically routes writes and reads to the appropriate partition based on a "partition key."

The primary performance benefit of partitioning is "partition pruning." When a query contains a filter on the partition key, the query optimizer analyzes the partition metadata and scans only the specific physical partitions that could contain the matching data, completely ignoring the rest of the table.

This keeps index sizes small, limits disk I/O, and allows operations like purging old data to be executed instantly by dropping an entire physical partition (e.g., DROP PARTITION) rather than running expensive, lock-heavy DELETE queries.

System Flow

flowchart TD A["Query: WHERE created_date = 2023-10-15"] --> B["Query Optimizer"] B -->|"Partition Pruning"| C{"Identify Target Partition"} C -->|"Match: Oct 2023"| D["Physical Partition: p2023_10"] C -->|"Ignore"| E["Physical Partition: p2023_09"] C -->|"Ignore"| F["Physical Partition: p2023_11"] D -->|"Scan Small Index"| G["Return Results"]

Query optimizer executing partition pruning to scan only the relevant physical partition file, bypassing unrelated data.

Real-World Examples Indicative

Uber trips table — monthly range partitions, zero-lock DROP

Uber's trips table (billions of rows) is range-partitioned by created_at month, with each monthly partition at ~200GB. Partition pruning means driver earnings reports (WHERE created_at >= '2023-10-01') scan only 1-2 partitions instead of the full table. Old partitions older than 3 years are dropped with ALTER TABLE trips DROP PARTITION — a metadata-only operation completing in milliseconds, replacing the previous approach of running DELETE queries that locked the table and took hours to complete at billion-row scale.

Shopify product catalog — list partitioning by shop_id bucket

Shopify's products table is list-partitioned by shop_id hash bucket (each bucket groups ~5,000 merchant shops). When a merchant queries their catalog (WHERE shop_id = 12345), the query optimizer resolves the bucket and scans only that partition's local B-Tree index (~1GB) rather than the full 50TB products table. During Black Friday 2022, Shopify partition-pruned 98%+ of disk I/O on catalog reads — a key enabler for sustaining 80K+ checkouts/min without adding horizontal database shards.

AWS Athena on S3 — Hive partition pruning across petabytes

Athena reads data from S3 using Hive-style directory partitioning: /year=2023/month=10/day=15/. A query filtering WHERE year=2023 AND month=10 lists only the October directory (~50 Parquet files) instead of scanning the full multi-petabyte dataset. This reduces query cost from $5/TB scanned to under $0.01 for date-ranged queries. Cloudflare uses this pattern to let customers query 90 days of edge access logs: each query scans only the requested date partitions, not the full 90-day corpus.

Anti-Patterns

Partitioning on High-Cardinality Columns

Partitioning a table by a unique identifier like user_id or transaction_id creates thousands of tiny partitions, bloating database metadata and destroying query planner performance.

Ignoring Partition Pruning in Queries

Writing queries that do not filter on the partition key forces the database engine to perform a "scatter-gather" scan across every single partition, which is slower than scanning an unpartitioned table.

Over-Partitioning

Creating daily partitions for a table that only receives a few thousand rows per day results in excessive file system overhead and inefficient disk space utilization.

Unbalanced Partition Boundaries (Skew)

Partitioning by a key that causes 90% of the data to land in a single partition (e.g., partitioning by status where 'active' is the dominant state) defeats the purpose of partitioning.

Design Tradeoffs

DimensionRange PartitioningHash Partitioning
Query optimizationIdeal for time-series and date-range queries; the optimizer prunes to a small set of partitions matching the filterInefficient for range queries; all partitions must be scanned since hash values are not contiguous
Write distributionHigh risk of write hotspots on the active range partition (e.g., today's date partition receives all new inserts)Distributes writes evenly across a fixed number of partitions using a hash function, minimizing hotspots
Data lifecycle managementSimple; drop an entire time-range partition to purge old data in milliseconds with no WAL overheadComplex; deleting aged data requires row-level DELETE queries across all hash partitions

Best Practices

Align Partition Keys with Query FiltersChoose a partition key that is present in almost all high-volume query WHERE clauses (typically a timestamp or tenant identifier).
Automate Partition CreationUse database extensions (like pg_partman in PostgreSQL) or scheduled cron jobs to pre-create future partitions before writes arrive.
Keep Partition Sizes UniformTarget a physical partition size of 10GB to 50GB for relational databases, and up to 100GB+ for analytical data warehouses.
Drop Partitions for Data RetentionInstead of running DELETE FROM table WHERE created_at < '2022-01-01', use ALTER TABLE table DROP PARTITION p2021 to instantly reclaim disk space without generating WAL overhead or table locks.
Use Local IndexesCreate indexes that are local to each partition rather than global indexes, ensuring that index maintenance operations are isolated and fast.

When to Use / Avoid

Use WhenAvoid When
Single tables exceed 100GB in size or contain tens of millions of rows.The total table size is under 10GB; indexes alone are highly efficient at this scale.
The dataset has a natural temporal lifecycle (e.g., logs, metrics, or transactional history that is archived after N days).The query patterns are highly dynamic and rarely filter on a consistent column.
You need to run heavy analytical queries alongside transactional writes without resource contention.The primary key must be globally unique and enforced across partitions without including the partition key.