System Design · Consistency & Replication · 2026-07-04

Chain Replication: Linearizable Reads Without Overloading Your Leader

Core concept

In most replication setups, every read and write hits the same leader node, which becomes a bottleneck. Chain replication arranges replicas in a linear sequence — head, middle nodes, tail — where writes enter at the head and flow forward, and reads are served exclusively by the tail. Because a read only succeeds after a write has reached the tail, you automatically get linearizability (every read sees the most recently completed write, globally) without any extra coordination. The head and tail have clearly separated jobs, which makes the protocol surprisingly simple to reason about and audit.

flowchart LR
    Client -->|write| Head
    Head -->|forward| Mid
    Mid -->|forward| Tail
    Tail -->|ack to client| Client
    Client2[Client] -->|read| Tail

Concrete real-world example

Amazon's internal storage system for S3 (a large-scale cloud object store) used chain replication as the foundation for a variant called CRAQ (Chain Replication with Apportioned Queries — a version that also lets middle nodes serve reads). When you upload an object to S3, the write propagates down the chain; only after the tail acknowledges does the upload return success. Because every read goes to the tail — the node that received every committed write — you never serve a stale object to a reader, even under heavy traffic.

One trade-off / gotcha

Write latency is proportional to the length of the chain: a write must travel through every replica before the client gets an ack. A chain of five nodes adds four extra network hops to every write. This is the inverse of quorum systems (where writes touch a majority and return), so chain replication trades write latency for read simplicity — reads are free of coordination overhead because the tail always has the truth.

An interview-style question to ponder

Your system uses a three-node chain. The middle node crashes mid-write — the head has applied the write but the tail has not yet received it. How should the system recover, and which writes, if any, might be lost?

Stuck? Show a hint

Think about what the tail knows versus what the head knows, and which of those two sources a recovering system should trust as the authority on "committed."

Show answer

No writes that the tail has not yet acknowledged are considered committed; they can be safely dropped or replayed, so nothing the client was told succeeded is ever lost.

Commitment is defined by the tail, not the head. A write is only "done" when the tail receives it and sends the ack to the client. If the tail never got the write, the client never got an ack, so the client knows to retry.
On middle-node failure, the system removes that node from the chain, linking the head directly to the tail. The head then replays any writes it sent to the dead middle node but hasn't yet been acknowledged down the chain — the tail deduplicates them by sequence number.
The new shorter chain (head → tail) resumes immediately; in-flight writes are replayed, not lost, because the head keeps a small pending-writes buffer until the downstream node confirms receipt.
But why not just let the head serve reads during recovery, since it always has the freshest data? The head sees writes before the tail, meaning it holds writes that may not be committed yet. Reading from the head could expose a write that later gets rolled back if a downstream node fails, breaking linearizability.
Watch out: if the head itself crashes with pending writes, those writes were never acked to the client, so they're safe to discard — but you must promote the next node to head carefully, ensuring it doesn't serve half-applied state.