← all lessons
System Design · Consistency & Replication ·

Sloppy Quorums: Staying Available When Your Cluster Partially Fails

Core concept
A normal quorum (minimum number of nodes that must agree before a read or write succeeds) guarantees strong consistency, but it also means your system becomes unavailable the moment too many of the designated nodes for a key go offline. A sloppy quorum relaxes this: instead of refusing a write, the system lets any available node in the cluster temporarily accept it, even if that node isn't in the key's normal "home" node set. Once the failed home nodes come back, those temporarily parked writes are forwarded to them in a process called hinted handoff (a node holds a write with a sticky note saying "deliver this when the real owner returns"). This trades strict consistency for continued availability during partial failures.

Diagram

flowchart LR
    Client -->|write key X| Coordinator
    Coordinator -->|home node down| NodeA
    Coordinator -->|home node down| NodeB
    Coordinator -->|available stand-in| NodeC
    NodeC -->|hinted handoff on recovery| NodeD
    NodeD -->|now holds key X| Done

Concrete real-world example
Amazon's Dynamo (a highly available key-value store built for shopping cart data) pioneered this pattern. If you add an item to your cart and one of the three home nodes for your user ID is down, Dynamo doesn't reject the write. A stand-in node accepts it, stores a hint, and pushes it to the correct node when it recovers — so the customer never sees an error, even during a partial data-center outage. The cost: if you read from a home node before the handoff completes, you may not see that cart item yet. Eventual consistency, not instant consistency.

One trade-off / gotcha
Hinted handoff can silently fail. If the stand-in node itself crashes before the home node recovers, the hint is lost and the write is gone — with no error ever surfaced to the client. This is why systems using sloppy quorums usually pair them with anti-entropy (a background process that compares and reconciles data across nodes using a Merkle tree — a hash tree where each parent hashes its children — to efficiently find divergence). Without anti-entropy, data loss during cascading failures is a real risk, not a theoretical one.

An interview-style question to ponder
You're building a ride-sharing service. Writes go to driver location updates (thousands per second, extremely write-heavy) and reads come from the dispatch system matching riders to nearby drivers. You're considering sloppy quorums. Under what failure conditions does a sloppy quorum help you, and under what conditions does it hurt you, specifically for this use case?

Stuck? Show a hint

Think about what the dispatch system needs to be correct about versus what it merely needs to be responsive about — and ask whether a slightly stale driver location is catastrophic or just mildly suboptimal.

Show answer

Sloppy quorums help you here in write availability but introduce meaningful read risk that you must explicitly design around.

  • Driver location updates are fire-and-forget at high volume: a write rejection under node failure is far more damaging than a briefly stale location. Sloppy quorums let writes land on stand-ins, keeping the firehose flowing.
  • On the read side, dispatch reading a driver location that's 2–3 seconds stale is almost always acceptable — a driver moves ~30 meters in 3 seconds at city speed, which doesn't break a match. The system is tolerant of bounded staleness, which is exactly the regime where sloppy quorums are safe.
  • Where it hurts: if you also store driver availability status (online/offline) in the same store and a hinted handoff lags, dispatch could route a ride to a driver who just went offline. That's a hard correctness failure, not a soft one. Availability-sensitive data (location) and correctness-sensitive data (status) should be in separate stores or handled with different consistency settings.
  • But why not just use strict quorums everywhere and accept the latency? At thousands of writes per second across a geo-distributed fleet, waiting for all home nodes to acknowledge adds tail latency that compounds — a single slow node can stall the entire write path. The operational cost of strict quorums at this write volume is high enough that you engineer around the consistency gap rather than paying for it on every write.
  • Watch out: hinted handoff without anti-entropy means a crashed stand-in node silently drops your driver's last known location — test your failure scenarios explicitly.