← all lessons
System Design · Consistency & Replication ·

Two-Phase Commit: Coordinating Atomic Writes Across Multiple Nodes

Core concept
When a write must land on several independent nodes simultaneously — either all succeed or none do — you need a protocol that enforces that all-or-nothing promise. Two-phase commit (2PC) does this by splitting the commit into two distinct steps: first ask every participant "can you promise to commit?", then, only if all say yes, tell every participant "now actually commit." The coordinator node (the single orchestrator running the protocol) collects votes in phase one and issues the final decree in phase two. The critical insight is that a participant who votes "yes" in phase one is making a durable promise: it must not abandon that promise even if it crashes and restarts.

Diagram

flowchart LR
    C[Coordinator]
    A[Node A]
    B[Node B]
    D[Node C]
    C -->|Phase 1: Prepare| A
    C -->|Phase 1: Prepare| B
    C -->|Phase 1: Prepare| D
    A -->|Vote: Yes| C
    B -->|Vote: Yes| C
    D -->|Vote: Yes| C
    C -->|Phase 2: Commit| A
    C -->|Phase 2: Commit| B
    C -->|Phase 2: Commit| D

Concrete real-world example
Imagine a flight booking system where reserving a seat requires debiting a payment database and writing a booking record to a reservations database — two separate services, two separate machines. Without 2PC you could debit the card but crash before writing the reservation, or vice versa. With 2PC, the coordinator sends "Prepare" to both services in phase one. Payment locks the funds and logs its promise to disk; Reservations locks the seat row and logs its promise. Both vote yes. The coordinator logs the "commit" decision durably, then sends "Commit" to both. If the coordinator crashes after logging but before sending, when it restarts it reads its own log and re-sends the commit — both participants are still waiting and will accept it.

One trade-off / gotcha
The coordinator is a single point of failure during the window between collecting all votes and delivering the phase-two verdict. If the coordinator crashes right after participants have voted yes but before they receive "Commit" or "Abort," those participants are stuck in an uncertain state — they hold locks and cannot safely decide on their own, because the missing message could be either outcome. This blocking behavior is 2PC's defining weakness: a crashed coordinator can hold your entire system hostage until it recovers.

An interview-style question to ponder
Your team suggests deploying 2PC across three data centers in different geographic regions to keep global inventory consistent. A staff engineer pushes back hard. What specific failure scenarios make cross-datacenter 2PC dangerous, and what architectural alternatives would you consider?

Stuck? Show a hint

Think about what the word "blocking" means when the coordinator is not just a crashed process but an entire unreachable datacenter — and consider how long participants must hold their locks while waiting for a verdict that may not arrive for minutes or hours.

Show answer

Cross-datacenter 2PC is dangerous because network partitions (when nodes can't reach each other) turn the coordinator's unavailability from a brief crash-recovery scenario into a potentially multi-minute or permanent blockage, during which all participants hold locks and refuse new writes.

  • In a single datacenter a crashed coordinator restarts in seconds; across regions, a network partition can last minutes or longer. Every participant sitting in the "uncertain" state holds row-level locks for that entire duration, grinding writes on those rows to a halt globally.
  • The blast radius compounds with participant count: with three datacenters each housing multiple database shards, a single coordinator outage can freeze dozens of lock holders simultaneously. At even modest transaction rates, lock queues pile up fast.
  • The fundamental issue is that 2PC sacrifices availability for atomicity. In a cross-region setup you are making that trade under the worst possible latency and failure conditions, which is precisely what CAP theorem (the impossibility of perfect consistency, availability, and partition-tolerance simultaneously) predicts will hurt you most.
  • Practical alternatives: saga pattern (break the transaction into independent steps each with a compensating undo action, tolerating brief inconsistency), or routing all writes for a global entity through a single "home region" to keep 2PC local while using async replication outward.
  • But why not just use a stronger protocol like Paxos (algorithm for reaching consensus across unreliable nodes) instead? Paxos and its relatives (Raft, Multi-Paxos) solve the coordinator-crash problem by electing a new leader without blocking, but they add significant latency per commit round and implementation complexity; they are better suited to replicating a single log than to coordinating heterogeneous services that each own their own state.
  • Watch out: even a "successful" 2PC round across regions adds one full cross-region round-trip per phase — easily 100–300 ms per commit — which is often unacceptable for high-throughput transactional workloads regardless of correctness concerns.