← all lessons
System Design · Storage & Indexing ·

Write-Ahead Logging (WAL): Durability Without Sacrificing Speed

- Core concept — A Write-Ahead Log is a durability mechanism where every change is first written sequentially to an append-only log on disk before being applied to the actual data structures in memory or on disk. Because sequential disk writes are far faster than random writes, this gives you crash safety without paying the full cost of flushing modified pages on every operation. On recovery after a crash, the database replays uncommitted log entries to restore a consistent state. PostgreSQL, RocksDB, etcd, and Kafka all rely on variants of this pattern. The log is the source of truth; the main data store is effectively a materialized view of it.

- Diagram

flowchart LR
    C[Client Write] --> W[WAL on Disk\nsequential append]
    W --> A[ACK to Client]
    W --> B[Apply to\nBuffer Pool]
    B --> D[Flush Pages\nto Disk async]
    R[Crash Recovery] -->|replay WAL| B

- Concrete real-world example — In PostgreSQL, when you UPDATE a row, the change is recorded in the WAL segment file first. The dirty page in the shared buffer pool is flushed to the heap file later by the background writer. If the server crashes before that flush, PostgreSQL's startup process replays WAL records from the last checkpoint forward, guaranteeing no committed transaction is lost. This is also what enables point-in-time recovery (PITR): ship WAL segments to S3, replay them on a replica up to any timestamp.

- One trade-off / gotcha — WAL creates write amplification: every byte of user data gets written at least twice — once to the log, once to the main data file. Under high write throughput, WAL segments can grow faster than archiving can ship them, creating lag that breaks replication or fills your disk. Tuning checkpoint_completion_target and wal_buffers in Postgres, or sync_log_entry_size in other systems, is critical to balance durability, throughput, and storage costs. Also watch out: a WAL that's too large makes crash recovery slow, since you must replay from the last checkpoint.

- An interview-style question to ponder — A write-heavy service is seeing high p99 latency spikes every few minutes. Monitoring shows the spikes correlate with checkpoint events in your database. What's happening, and what levers would you pull to smooth this out without sacrificing durability?

Stuck? Show a hint

A checkpoint does exactly one thing: it flushes every dirty page accumulated since the last one. Picture that as a sudden burst of random disk writes fighting your live traffic — that's your spike. Now think about two independent knobs: how often checkpoints fire, and whether their writes can be spread out rather than dumped all at once.

Show answer

The spikes are checkpoint I/O storms — smooth them by spreading the flush over time and firing checkpoints less often, none of which weakens durability.

  • What's actually happening: a checkpoint flushes every dirty page accumulated since the last one to the data files at once, a burst of random write I/O that contends with foreground traffic — that periodic storm is your p99 spike.
  • Lever 1, pace the flush: raise checkpoint_completion_target toward 0.9 so the writes spread across most of the interval instead of being dumped in one go.
  • Lever 2, checkpoint less often: lengthen the interval via checkpoint_timeout and max_wal_size, tune the background writer so dirty pages trickle out continuously, and give WAL and data files separate devices with IOPS headroom so the flush never starves the log.
  • But doesn't smoothing the flush risk durability? No — every commit still forces its WAL record to disk with fsync before the client is acknowledged. You are only changing when data pages get written back, never whether a committed transaction is safe.
  • Watch out: don't stretch the interval so far that crash recovery — which replays from the last checkpoint forward — becomes unacceptably slow; it's a balance, not a free win.