Storage Engine Internals
Choose LSM-Trees for write-heavy workloads to convert random disk writes into high-speed sequential I/O.
- Choose LSM-Trees for write-heavy workloads to convert random disk writes into high-speed sequential I/O.
- Select B-Trees for read-heavy workloads to achieve predictable, single-hop read latencies.
- Mitigate LSM-Tree read amplification by utilizing Bloom filters and tuning compaction strategies.
- Monitor write amplification to prevent premature SSD wear and I/O starvation during peak traffic.
The Problem
Using the wrong storage engine for your workload leads to severe performance degradation. If you run a high-throughput logging or telemetry ingestion system on a B-Tree storage engine, the random disk writes required to update leaf nodes will quickly saturate disk I/O, causing write queues to back up.
Conversely, if you run a low-latency, random-read application on an unoptimized LSM-Tree engine, the system must search multiple files on disk for a single key, causing massive read latency spikes and high CPU usage.
Core System Idea
At the lowest level, storage engines manage how data is structured in memory and on disk. The two dominant designs are B-Trees and Log-Structured Merge-Trees (LSM-Trees).
B-Trees organize data into fixed-size pages (typically 4KB to 16KB) structured as a balanced search tree. Updates are performed "in-place" by finding the target page, modifying it in memory, and writing it back to disk. This provides fast, predictable O(log N) reads.
LSM-Trees optimize for writes by avoiding in-place updates. Writes are appended sequentially to an in-memory "MemTable" and a Write-Ahead Log (WAL) for durability. When the MemTable fills up, it is flushed to disk as an immutable, sorted file called an "SSTable" (Sorted String Table).
Because SSTables are immutable, updating a key creates a new entry rather than overwriting the old one. To reclaim space and keep read performance acceptable, a background "compaction" process continuously merges and deduplicates SSTables.
System Flow
LSM-Tree write path (MemTable/WAL) and read path optimization using Bloom filters and background compaction.
Real-World Examples Indicative
Facebook replaced InnoDB with MyRocks (RocksDB) for their UDB (User Database) in 2017. The LSM-Tree structure converts InnoDB's random page writes into sequential SSTable appends — on their 15TB user database this cut write amplification from ~10x (InnoDB) to ~2.5x (RocksDB Leveled Compaction). Bloom filter false-positive rate was tuned to 1% with bloom_bits_per_key=10, eliminating unnecessary SSTable disk reads for missing keys. The migration saved 50% storage space and reduced P99 write latency, while freeing up ~50% of SSD I/O headroom that InnoDB's in-place page rewrites had previously consumed.
GitHub's primary MySQL cluster uses InnoDB B+ Trees for their repositories, commits, and pull_requests tables. InnoDB's buffer pool is configured at 75% of available RAM (~128GB on their database hosts), keeping hot B-Tree pages in memory and achieving P50 read latency under 1ms for repository metadata lookups. In-place B-Tree updates enable concurrent MVCC reads without read locks, supporting GitHub's 100M+ repository read workload without read-replica fan-out for most queries.
Netflix configures Leveled Compaction Strategy (LCS) on their viewing_history Cassandra table rather than the default Size-Tiered Compaction. LCS limits read amplification to ~1 SSTable per level, keeping P99 read latency for history lookups under 5ms even at 500M+ rows. sstable_size_in_mb=160 (4x the default) reduces compaction frequency during peak streaming hours (7–10 PM), preventing compaction I/O from competing with concurrent viewer queries on the same disk.
Anti-Patterns
Failing to configure Bloom filters forces the engine to perform physical disk I/O on multiple SSTables for every single read query that returns a miss.
Writing high-volume time-series or log data to a B-Tree causes massive page fragmentation and random I/O bottlenecks.
Allowing write rates to exceed compaction capacity causes SSTable counts to explode, leading to severe read degradation and eventual write stalls.
Allocating excessive memory to MemTables delays flushes, leading to extremely long recovery times when the database crashes and must replay the WAL.
Design Tradeoffs
| Dimension | LSM-Tree | B-Tree |
|---|---|---|
| Write throughput | High; writes are sequential appends to in-memory MemTable and WAL, avoiding random disk I/O | Lower; writes require random disk I/O to find and update pages in-place on every mutation |
| Read amplification | High; reads must check the MemTable and potentially multiple SSTable levels before finding the key | Low; reads traverse a single balanced tree structure to a specific page in O(log N) hops |
| Write amplification | High; background compaction continuously rewrites data to merge and deduplicate SSTable levels | Lower; pages are written to disk only when dirtied, with no background rewrite process |
Best Practices
When to Use / Avoid
| Use When | Avoid When |
|---|---|
| The workload is highly write-intensive (e.g., ingestion of metrics, logs, or user activity streams). | The workload requires predictable, ultra-low latency random reads with minimal variation. |
| Storage hardware consists of high-speed SSDs where sequential writes maximize lifespan. | The database must run on low-cost, high-latency spinning disks (HDDs) where random reads are extremely slow. |
| You need high data compression ratios; immutable SSTables compress much better than fragmented B-Tree pages. | The application requires heavy, complex range queries across highly dynamic datasets. |