← System Design Data & Messaging Systems
System Design

Storage Engine Internals

Choose LSM-Trees for write-heavy workloads to convert random disk writes into high-speed sequential I/O.

TL;DR
  • Choose LSM-Trees for write-heavy workloads to convert random disk writes into high-speed sequential I/O.
  • Select B-Trees for read-heavy workloads to achieve predictable, single-hop read latencies.
  • Mitigate LSM-Tree read amplification by utilizing Bloom filters and tuning compaction strategies.
  • Monitor write amplification to prevent premature SSD wear and I/O starvation during peak traffic.

The Problem

Using the wrong storage engine for your workload leads to severe performance degradation. If you run a high-throughput logging or telemetry ingestion system on a B-Tree storage engine, the random disk writes required to update leaf nodes will quickly saturate disk I/O, causing write queues to back up.

Conversely, if you run a low-latency, random-read application on an unoptimized LSM-Tree engine, the system must search multiple files on disk for a single key, causing massive read latency spikes and high CPU usage.

Core System Idea

At the lowest level, storage engines manage how data is structured in memory and on disk. The two dominant designs are B-Trees and Log-Structured Merge-Trees (LSM-Trees).

B-Trees organize data into fixed-size pages (typically 4KB to 16KB) structured as a balanced search tree. Updates are performed "in-place" by finding the target page, modifying it in memory, and writing it back to disk. This provides fast, predictable O(log N) reads.

LSM-Trees optimize for writes by avoiding in-place updates. Writes are appended sequentially to an in-memory "MemTable" and a Write-Ahead Log (WAL) for durability. When the MemTable fills up, it is flushed to disk as an immutable, sorted file called an "SSTable" (Sorted String Table).

Because SSTables are immutable, updating a key creates a new entry rather than overwriting the old one. To reclaim space and keep read performance acceptable, a background "compaction" process continuously merges and deduplicates SSTables.

System Flow

flowchart TD A["Write Request"] -->|"Append"| B["Write-Ahead Log"] A -->|"Insert"| C["In-Memory MemTable"] C -->|"Flush when full"| D["L0 SSTable on Disk"] D -->|"Compaction Process"| E["L1 SSTable on Disk"] F["Read Request"] -->|"Check"| G{"Bloom Filter"} G -->|"Maybe Present"| H["Search SSTables"] G -->|"Not Present"| I["Return Null"]

LSM-Tree write path (MemTable/WAL) and read path optimization using Bloom filters and background compaction.

Real-World Examples Indicative

Facebook MyRocks — 50% storage reduction vs InnoDB

Facebook replaced InnoDB with MyRocks (RocksDB) for their UDB (User Database) in 2017. The LSM-Tree structure converts InnoDB's random page writes into sequential SSTable appends — on their 15TB user database this cut write amplification from ~10x (InnoDB) to ~2.5x (RocksDB Leveled Compaction). Bloom filter false-positive rate was tuned to 1% with bloom_bits_per_key=10, eliminating unnecessary SSTable disk reads for missing keys. The migration saved 50% storage space and reduced P99 write latency, while freeing up ~50% of SSD I/O headroom that InnoDB's in-place page rewrites had previously consumed.

InnoDB at GitHub — 128GB buffer pool for B-Tree reads

GitHub's primary MySQL cluster uses InnoDB B+ Trees for their repositories, commits, and pull_requests tables. InnoDB's buffer pool is configured at 75% of available RAM (~128GB on their database hosts), keeping hot B-Tree pages in memory and achieving P50 read latency under 1ms for repository metadata lookups. In-place B-Tree updates enable concurrent MVCC reads without read locks, supporting GitHub's 100M+ repository read workload without read-replica fan-out for most queries.

Netflix Cassandra — Leveled Compaction at 500M rows

Netflix configures Leveled Compaction Strategy (LCS) on their viewing_history Cassandra table rather than the default Size-Tiered Compaction. LCS limits read amplification to ~1 SSTable per level, keeping P99 read latency for history lookups under 5ms even at 500M+ rows. sstable_size_in_mb=160 (4x the default) reduces compaction frequency during peak streaming hours (7–10 PM), preventing compaction I/O from competing with concurrent viewer queries on the same disk.

Anti-Patterns

Running LSM-Trees Without Bloom Filters

Failing to configure Bloom filters forces the engine to perform physical disk I/O on multiple SSTables for every single read query that returns a miss.

Using B-Trees for High-Frequency Append-Only Data

Writing high-volume time-series or log data to a B-Tree causes massive page fragmentation and random I/O bottlenecks.

Ignoring Compaction Starvation

Allowing write rates to exceed compaction capacity causes SSTable counts to explode, leading to severe read degradation and eventual write stalls.

Sizing MemTables Too Large

Allocating excessive memory to MemTables delays flushes, leading to extremely long recovery times when the database crashes and must replay the WAL.

Design Tradeoffs

DimensionLSM-TreeB-Tree
Write throughputHigh; writes are sequential appends to in-memory MemTable and WAL, avoiding random disk I/OLower; writes require random disk I/O to find and update pages in-place on every mutation
Read amplificationHigh; reads must check the MemTable and potentially multiple SSTable levels before finding the keyLow; reads traverse a single balanced tree structure to a specific page in O(log N) hops
Write amplificationHigh; background compaction continuously rewrites data to merge and deduplicate SSTable levelsLower; pages are written to disk only when dirtied, with no background rewrite process

Best Practices

Use Bloom Filters for LSM ReadsAlways enable Bloom filters on LSM-Tree engines to bypass SSTable disk reads for non-existent keys.
Tune Compaction StrategiesUse Leveled Compaction for read-heavy workloads to minimize read amplification, and Size-Tiered Compaction for write-heavy workloads to minimize write amplification.
Isolate WAL DisksPlace the Write-Ahead Log on a separate physical SSD from the SSTables to prevent compaction I/O from blocking write-ahead logging.
Align Page Sizes with HardwareFor B-Trees, match the database page size (e.g., 4KB or 8KB) to the underlying file system block size and SSD page size to prevent partial-page write overhead.
Monitor Disk Space OverheadEnsure you have at least 50% free disk space when using Size-Tiered Compaction, as merging large SSTables requires temporary duplicate storage.

When to Use / Avoid

Use WhenAvoid When
The workload is highly write-intensive (e.g., ingestion of metrics, logs, or user activity streams).The workload requires predictable, ultra-low latency random reads with minimal variation.
Storage hardware consists of high-speed SSDs where sequential writes maximize lifespan.The database must run on low-cost, high-latency spinning disks (HDDs) where random reads are extremely slow.
You need high data compression ratios; immutable SSTables compress much better than fragmented B-Tree pages.The application requires heavy, complex range queries across highly dynamic datasets.