LSM Trees: Why Write-Heavy Databases Abandon B-Trees
Core concept
A Log-Structured Merge-Tree (LSM Tree) optimizes for write throughput by never performing in-place updates — all writes land first in a fast, in-memory buffer (the memtable), then are flushed sequentially to immutable disk files called SSTables (Sorted String Tables). Because sequential disk writes are orders of magnitude faster than random I/O, LSM Trees dramatically outperform B-Trees on write-heavy workloads. A background compaction process periodically merges and garbage-collects SSTables, discarding stale versions and keeping the number of files manageable. This architecture is the engine behind Cassandra, RocksDB, LevelDB, and HBase.
Diagram
flowchart LR
A[Write Request] --> B[Memtable\nin-memory]
B -->|flush when full| C[SSTable L0\non disk]
C -->|compaction| D[SSTable L1\nmerged + sorted]
D -->|compaction| E[SSTable L2\nlarger, sorted]
F[Read Request] --> B
F --> C
F --> D
Concrete real-world example
Apache Cassandra uses LSM Trees to power write-heavy use cases like time-series event ingestion or IoT telemetry. When a sensor writes a reading, Cassandra appends it to the memtable instantly and acknowledges success — no disk seek required. During compaction, older SSTables at the same level are merged into a single, larger, sorted file at the next level, and tombstones (delete markers) are finally purged. This is why Cassandra can sustain hundreds of thousands of writes per second on commodity hardware.
One trade-off / gotcha
Reads are the Achilles' heel. To read a single key, the system must check the memtable and potentially every SSTable level (newest to oldest), because the key could live anywhere. Databases mitigate this with Bloom filters (a probabilistic data structure that quickly rules out SSTables that definitely don't contain the key), but reads still have higher and less predictable latency than B-Tree systems. Heavy compaction also competes with live traffic for I/O bandwidth — a phenomenon called write amplification — where one logical write triggers multiple physical writes across levels.
An interview-style question to ponder
A team is building a leaderboard feature that needs both extremely high write throughput (millions of score updates per second) and fast point reads (fetch a specific user's score in under 5ms). You propose using an LSM-Tree-backed store. Your interviewer pushes back: "Won't reads be too slow?" How do you address this concern while keeping the write performance advantage?
Stuck? Show a hint
Separate the two kinds of read: a point read (fetch one user's score) versus a range scan across many users. LSM's slow-read reputation comes mostly from one of them — work out which. Then ask whether a small per-file filter could let a point read skip the files that can't possibly hold the key.
Show answer
Keep the LSM store: its point reads stay fast because Bloom filters and caching cut the read amplification the interviewer is worried about.
- Why reads stay fast: a Bloom filter per SSTable probabilistically rules out the levels that can't contain a key, so a point read touches only the few levels that plausibly hold it instead of scanning them all.
- Extra leverage: a well-tuned tiered compaction keeps the SSTable count small, and an in-memory row cache (as used in Cassandra) absorbs repeated hot-user lookups entirely in RAM.
- But isn't LSM read-slow in general? Only for range scans across many users, where data spread across levels hurts. This workload is point reads (one user at a time), so that weakness doesn't apply.
- The practical close: pair the LSM store with generous Bloom filters and a row cache sized to your hot-user working set, then benchmark p99 read latency against the 5ms SLA before conceding it's too slow.
- Watch out: Bloom filters give false positives but never false negatives, so size them for a low false-positive rate — an undersized filter quietly drags you back into extra disk reads.