← all lessons
System Design · Consistency & Replication ·

Hinted Handoff: Remembering Writes for Nodes That Were Away

Core concept
When a target node is temporarily unreachable, instead of failing the write, a different healthy node volunteers to hold the data as a short-lived placeholder — a "hint" — until the original node recovers and can receive it. This keeps writes succeeding during partial outages without permanently rerouting data away from its owner. Once the downed node comes back, the hint-holder forwards the buffered writes and then discards its copy. It's a temporary post-office redirect, not a permanent change of address.

Diagram

flowchart LR
    C["Client Write"]
    A["Node A (target, down)"]
    B["Node B (hint-holder)"]
    S["Hint Store on B"]
    R["Node A Recovers"]

    C -->|"write fails to A"| B
    B --> S
    S -->|"A is back"| R
    R -->|"hint delivered"| A

Concrete real-world example
Dynamo (Amazon's internal key-value store) popularized this pattern. Imagine a shopping cart write during Black Friday when one of the three replica nodes briefly drops off the network. Rather than returning an error to the customer, Dynamo routes the write to a fourth node that stores it with a metadata tag saying "this really belongs to Node 3." When Node 3 rejoins — often in seconds or minutes — the hint-holder pushes the buffered write over and deletes its local copy. The customer never saw a failure, and Node 3 eventually becomes fully consistent.

One trade-off / gotcha
Hints are only as durable as the hint-holding node. If that node also crashes before it can deliver the hint, the write is silently lost — you don't get a durable guarantee just because the client received a success acknowledgment. For this reason, most systems set a strict cap (e.g., keep hints for no longer than a few hours) and limit how many hints a single node will buffer, preventing unbounded disk growth during a prolonged outage.

An interview-style question to ponder
A system uses hinted handoff and accepts a write successfully. The original target node is down for 36 hours — far beyond the hint expiry window — and the hint is discarded. When the node finally comes back, how does the system detect and repair the now-missing write?

Stuck? Show a hint

Think about what mechanisms continuously reconcile replica state rather than relying on a one-time delivery event — the key tension is "push when the event happens" vs. "pull by comparing state later."

Show answer

The system falls back to anti-entropy repair (a background process that compares and syncs replica contents) to detect and fill the gap.

  • Hinted handoff is a best-effort, time-bounded shortcut for short outages; it is not a durability guarantee. Once the hint expires, the system has no record that a write was ever attempted for the recovering node.
  • Most systems pair hinted handoff with a Merkle tree (a hash tree that lets two nodes quickly find which data ranges differ) based anti-entropy process. Each node periodically exchanges Merkle tree root hashes with its replicas; a mismatch narrows the search to the exact key range that diverged, without comparing every record.
  • With a 36-hour outage and say 1 million writes missed, a full key-by-key scan would be prohibitively slow — but Merkle trees let two nodes find the differing segment in O(log N) hash comparisons, then sync only the ~1 million affected keys rather than the entire dataset.
  • But why not just replay the write-ahead log (append-only record of every write) from the time the node went down? Logs work well for short gaps but require the source node to have retained every write for 36 hours, which is expensive; anti-entropy works even when no log exists, by comparing current state rather than historical events.
  • Watch out: anti-entropy runs on a schedule (often minutes apart), so there is always a small window where a just-recovered node serves slightly stale reads before the reconciliation pass catches up.