Saga Pattern
Sagas replace two-phase commit with a sequence of local transactions and compensating rollbacks — eventual consistency in exchange for no distributed locking.
- Sagas replace two-phase commit with a sequence of local transactions and compensating rollbacks — eventual consistency in exchange for no distributed locking.
- Every saga step must have a defined compensating transaction written before the happy path ships. Discovering you can't roll back step 3 after step 5 fails is a production incident.
- Compensating transactions are not undos — they are new forward-moving actions that neutralize a previous step's effect. A refund is not the inverse of a charge; it's a new credit.
- Orchestration (a central coordinator directs each step) is easier to debug; choreography (services react to each other's events) is more resilient. Most teams underestimate choreography's operational complexity.
- Idempotency is non-negotiable: saga steps and compensations are retried on failure — a non-idempotent compensation causes double refunds, double releases, and data corruption.
The Problem
An e-commerce checkout flow: charge the card → reserve inventory → create shipment. The card is charged successfully. Inventory is out of stock — reservation fails. Now what? With 2PC, all three operations would be rolled back atomically. Without it (the normal case in microservices with separate databases), the charge is committed and there is no automatic rollback. Without a saga, the engineering team discovers this edge case in production when a customer is charged for an item that will never ship.
Core System Idea
A saga models a distributed business process as a sequence of local ACID transactions, one per service. Each step commits locally and triggers the next step via an event or command. If any step fails, the saga executes compensating transactions in reverse order for all previously completed steps. Compensation is not 2PC rollback — it's a new forward transaction that neutralizes the effect (refund the charge, release the reservation). Two coordination topologies: Orchestration — a central saga coordinator (Temporal, Netflix Conductor, AWS Step Functions) maintains saga state, directs each participant, handles retries and compensation. Best for complex, long-running flows with branching logic. Choreography — services react to each other's events via a message broker (Kafka, SQS) with no central coordinator. Best for simple, reactive flows. Both require every step to be idempotent and every compensation to be defined upfront.
System Flow
Orchestrated saga: each step triggers the next on success; failure triggers compensating transactions in reverse.
Real-World Examples Indicative
At Black Friday scale (millions of orders/hour), Shopify's checkout saga runs: inventory reservation → payment authorization → order confirmation → fulfillment trigger. If payment authorization fails after inventory is reserved, a compensating transaction releases the reservation immediately. Shopify uses a centralized workflow engine to track saga state — at this volume, any untracked partial failure creates a reconciliation nightmare.
A ride request saga: driver matching → driver acceptance → pickup confirmation → trip completion → payment processing. If payment fails after trip completion, Uber's compensation logs the debt and schedules a retry — they don't reverse the trip, they forward-compensate. This is the key insight: compensation is a business decision, not a technical rollback, and it must be designed per-step.
Netflix's open-source workflow orchestration engine manages saga-like flows for media encoding, account provisioning, and content delivery. Each workflow step is a microservice task; Conductor tracks state, handles retries with configurable backoff, and triggers compensation workflows on failure. Conductor persists saga state in Cassandra — orchestrator restarts don't lose in-flight workflows.
Anti-Patterns
Implementing the happy path and adding compensations "later." Later never comes until a customer is charged for an out-of-stock item. Define and test compensations before shipping the saga.
A refund compensation that issues a new charge-reversal on every retry will refund the customer multiple times if the first attempt times out. Every compensation must be safe to execute multiple times.
A 7-step saga implemented as pure event choreography — each service emits an event that triggers the next. When step 4 fails, determining what compensations have already run requires correlating 6 different event logs. Orchestration makes this visible in one place.
Calling another service synchronously within a saga step to "complete the step atomically." This creates distributed deadlocks when two sagas wait on each other's services. Each saga step must be a single local transaction.
An in-memory orchestrator that loses its state on restart leaves all in-flight sagas in an unknown state — some steps completed, compensations not triggered. Saga state must be persisted to a durable store before any step is executed.
Design Tradeoffs
| Dimension | Orchestration | Choreography |
|---|---|---|
| Flow visibility | Single trace in orchestrator | Distributed, event correlation required |
| Single point of failure | Yes (orchestrator) | No |
| Compensation tracking | Explicit in orchestrator | Implicit, harder to verify |
| Best for | Complex branching, long-running flows | Simple, reactive event chains |
| Tooling | Temporal, Conductor, Step Functions | Kafka + event-driven consumers |
Best Practices
When to Use / Avoid
| Use When | Avoid When |
|---|---|
| Business process spans multiple services with separate databases | Single service with one database — use a local ACID transaction |
| 2PC is impractical due to service heterogeneity or scale | Immediate strong consistency is required — saga's eventual consistency is unacceptable |
| Steps can be compensated with meaningful business actions | Compensation is impossible (irreversible external side effects like sending an SMS) |
| Long-running workflows (seconds to minutes) span multiple services | Simple 2-service coordination where a shared DB transaction is feasible |