← all lessons
System Design · Consistency & Replication ·

Fencing Tokens: Preventing Split-Brain Writes When a Lock Holder Comes Back Late

Core concept
When a distributed system grants a lock (exclusive permission to write) to a node, that node might pause — due to garbage collection, a network hiccup, or a slow disk — and by the time it resumes, the lock has already expired and been granted to someone else. Without a mechanism to detect this, two nodes can both believe they hold the lock and corrupt shared state simultaneously, a situation called split-brain. A fencing token solves this: every time a lock is granted, the lock service issues a monotonically increasing integer (a number that only ever goes up) alongside it. The storage system rejects any write whose token is lower than the highest token it has already seen, making stale lock holders harmless rather than dangerous.

flowchart LR
    A[Lock Service] -->|token=42| B[Node A gets lock]
    A -->|token=43| C[Node B gets lock]
    B -->|write + token=42| D[Storage]
    C -->|write + token=43| D
    D -->|reject token=42| B
    D -->|accept token=43| C

Concrete real-world example
Imagine a payment processor where exactly one worker node should write a "charge approved" record. Worker 1 acquires the lock and gets token 100, then pauses for 35 seconds during a Java garbage collection (automatic memory cleanup that can freeze a process). The lock expires after 30 seconds; Worker 2 acquires the lock and gets token 101. Worker 1 wakes up and tries to write with token 100. The database has already seen 101, so it silently drops Worker 1's write. The customer is charged exactly once.

One trade-off / gotcha
The storage layer itself must enforce the token check — the client cannot be trusted to self-police. If you're writing to a third-party service (like a cloud blob store) that has no concept of fencing tokens, this pattern cannot protect you. Leases plus idempotency keys (unique IDs per operation that deduplicate retries) are your fallback, but they guard against duplicates, not against a stale write that carries genuinely different data.

An interview-style question to ponder
A team argues: "We don't need fencing tokens — we use a 10-second lock TTL (time-to-live, the lock's expiration timeout) and our GC pauses never exceed 2 seconds, so the lock will always have expired before the old holder can act." Is this reasoning sound, and would you accept this system in production?

Stuck? Show a hint

Think about what guarantees a TTL actually provides — it constrains average behavior, but what about the tail? Consider what happens if the guarantee you're relying on is probabilistic rather than enforced.

Show answer

No, this reasoning is not sound, and you should not accept it in production without a fencing mechanism.

  • A TTL shrinks the probability of overlap, it does not eliminate it. GC pauses exceeding 2 s are a p99-tail event, not an absolute ceiling. A single JVM (Java Virtual Machine, the runtime for Java programs) under memory pressure can pause for 10–20 s. One event in a billion requests is still a real outage at scale.
  • Even if GC were bounded, clock drift between the lock server and the node means the node cannot precisely know when its own lock expired. A node may believe it has 1 s remaining while the lock service already revoked it 500 ms ago.
  • The core flaw is conflating "usually won't happen" with "cannot happen." Correctness properties in distributed systems must hold unconditionally, not statistically; a payment double-charge that occurs once a month is still a broken system.
  • But why not just make the TTL much longer, say 60 seconds, to create a huge safety margin? Longer TTLs mean a genuinely crashed node holds up the system for longer before a new leader can take over — you trade correctness risk for availability latency, and you still haven't eliminated the race, only widened the window before it becomes obvious.
  • Watch out: even with fencing tokens, a write that gets accepted might later be rolled back by the application if it assumes it still holds the lock — the token prevents stale writes from landing, but your application logic must also not act on side effects (like sending an email) before confirming the write succeeded.