System Design · Consistency & Replication · 2026-06-27

Saga Pattern: Coordinating Multi-Service Transactions Without Two-Phase Commit

Core concept
When a single business operation spans multiple independent services — say, booking a flight, charging a card, and reserving a hotel — you can't wrap them in one atomic database transaction because each service owns its own database. A Saga breaks the operation into a sequence of local transactions, each immediately committed, with a corresponding compensating transaction (a pre-written undo operation) that runs if any later step fails. There is no central lock held across services, so the system stays available and each service stays autonomous. The key insight is that you trade atomicity (all-or-nothing at once) for eventual correctness (all-or-nothing over time, via rollback).

flowchart LR
    A[Order Service\ncommit order] --> B[Payment Service\ncharge card]
    B --> C[Inventory Service\nreserve item]
    C --> D[Shipping Service\nschedule delivery]
    D --> E[Done ✓]
    C -->|step fails| F[Compensate:\nrefund card]
    F --> G[Compensate:\ncancel order]

Concrete real-world example
An e-commerce checkout creates an order, charges the card, and reserves warehouse stock — three services, three databases. If stock reservation fails after the card is already charged, the Saga triggers compensating transactions in reverse order: it issues a refund to the card, then marks the order as cancelled. The customer sees a clean failure ("item unavailable") rather than a hung transaction or a charge with no order.

One trade-off / gotcha
Compensating transactions must be idempotent (safe to run more than once with the same result) and always succeed — if a compensation itself fails, you're stuck in an inconsistent state with no further fallback. This means compensations cannot always truly undo side effects: a sent confirmation email cannot be unsent. Real systems handle this with explicit "saga log" tables (a durable record of each step's outcome, stored persistently) so a crashed coordinator can resume exactly where it left off, and by accepting that some effects (like emails) are simply non-compensable and are designed around that reality upfront.

An interview-style question to ponder
You're designing a ride-sharing app. A "request ride" saga involves: (1) matching a driver, (2) locking the driver as unavailable, (3) billing the rider's saved card. The billing step fails 5% of the time due to a flaky payment gateway. How do you decide whether to retry the billing step in place, or immediately trigger compensations to release the driver?

Stuck? Show a hint

Think about the cost asymmetry between the two failure paths: what does it cost the system to hold the driver locked for a few extra seconds versus what it costs to release the driver and restart the whole match? That asymmetry should drive your retry budget.

Show answer

Retry billing in place for a short, bounded window (e.g., 2–3 attempts over ~4 seconds) before triggering compensations.

Releasing the driver immediately on a transient billing failure throws away expensive work: the match algorithm may take several seconds and consume significant compute, and the driver is sitting idle anyway during the retry window — so holding the lock costs almost nothing real.
A 5% failure rate on a flaky gateway is almost certainly transient (network blip, timeout), not permanent (card declined). A single immediate retry resolves the majority of these cases; two retries with a short backoff (e.g., 1 s, then 3 s) resolve nearly all of them without a human-noticeable delay.
If all retries fail, then compensate: release the driver and notify the rider. At that point you have evidence it's a persistent failure (declined card, gateway outage), and holding the driver longer actively harms supply availability.
But why not retry indefinitely to maximize billing success? An unbounded retry loop turns a transient fault into an indefinite resource lock; the driver can never be reassigned, supply shrinks, and wait times rise for all riders — so your retry budget must be strictly capped.
Watch out: if your saga log doesn't record that billing was attempted, a crash mid-retry can cause the payment step to run again after recovery, potentially double-charging — always write "billing started" to the log before calling the payment gateway, not after.