Copy-on-Write B-Trees: Crash Safety Without a Write-Ahead Log
Core concept
A B-tree (a self-balancing tree structure that keeps sorted data on disk pages) normally updates pages in place, which means a crash mid-write can leave a page half-overwritten and the tree corrupt. Copy-on-write (CoW) B-trees sidestep this entirely: instead of modifying an existing page, you write a brand-new copy of that page with your changes, then walk up the tree updating parent pointers—all the way to a new root. The old pages are left untouched until they're no longer reachable. The result is that the tree is always in a valid state; either the old root is current or the new one is, and nothing in between is ever partially visible.
Diagram
flowchart TD
OldRoot["Old Root (v1)"] --> OldParent["Old Parent (v1)"]
OldParent --> OldLeaf["Old Leaf (v1)"]
OldParent --> StableLeaf["Unchanged Leaf"]
NewRoot["New Root (v2)"] --> NewParent["New Parent (v2)"]
NewParent --> NewLeaf["New Leaf (v2)"]
NewParent --> StableLeaf
Concrete real-world example
LMDB (Lightning Memory-Mapped Database, an embedded key-value store used inside OpenLDAP and many browsers) uses CoW B-trees as its entire durability strategy. When you commit a write transaction, LMDB atomically swaps a single pointer—the meta page—to point to the new root. Readers holding the old root keep reading their consistent snapshot without any locking. Crash at any point before the meta-page swap? The old tree is completely intact on disk. This is why LMDB can claim zero-copy reads and no write-ahead log at all.
One trade-off / gotcha
Every write now touches multiple pages—the changed leaf, every ancestor up to the root—even if you only changed one byte of data. On a balanced tree with a million entries, that's roughly 20 page writes per logical write (log base 200 of one million, for a typical branching factor). This write amplification is real, and it means CoW B-trees are poorly suited to high-throughput, write-heavy workloads where you'd prefer an LSM tree (a structure that batches writes in memory before flushing sequentially) instead.
An interview-style question to ponder
A candidate proposes using a CoW B-tree to back a leaderboard service that ingests 50,000 score updates per second from a mobile game. You need reads to always see a consistent snapshot and the system must survive crashes without data loss. Is CoW B-tree the right fit? What would you push back on, and what would you suggest instead?
Stuck? Show a hint
Focus on what "50,000 updates per second" means for the number of page writes hitting disk per second given CoW's amplification factor, then ask whether the read-snapshot benefit is actually special here or whether other approaches also provide it.
Show answer
A CoW B-tree is the wrong choice for this workload, and you should push back on it.
- At 50,000 score updates per second with a tree depth of ~4–5 levels, CoW generates roughly 200,000–250,000 page writes per second. That volume will saturate a typical SSD's random-write throughput (often 50,000–100,000 IOPS) before you've handled peak load, let alone left headroom for reads.
- The leaderboard is write-heavy, not read-heavy, so the free consistent snapshots CoW provides are mostly wasted value—a leaderboard reader tolerating a few seconds of staleness doesn't need strict snapshot isolation, just eventual consistency.
- An LSM-based store (batches writes in memory, flushes sorted runs sequentially) absorbs 50,000 writes/second comfortably because sequential I/O is 10–100× cheaper than random I/O; many databases in this category also support snapshot reads via multi-version concurrency control (each row stored with a version timestamp).
- But why not just add a write-ahead log to a standard in-place B-tree? That works too, but it means two writes per update (log entry + page), and you lose the lock-free reader benefit; an LSM tree at this write rate is still the cleaner fit.
- Watch out: if the leaderboard needs real-time exact ranking (not approximate), even an LSM solution requires careful index design—a sorted set structure (a data structure mapping scores to member IDs in sorted order) beats a plain key-value store for range rank queries.