← System Design Workflow Engineering
System Design

Saga Pattern

Sagas replace two-phase commit with a sequence of local transactions and compensating rollbacks — eventual consistency in exchange for no distributed locking.

TL;DR
  • Sagas replace two-phase commit with a sequence of local transactions and compensating rollbacks — eventual consistency in exchange for no distributed locking.
  • Every saga step must have a defined compensating transaction written before the happy path ships. Discovering you can't roll back step 3 after step 5 fails is a production incident.
  • Compensating transactions are not undos — they are new forward-moving actions that neutralize a previous step's effect. A refund is not the inverse of a charge; it's a new credit.
  • Orchestration (a central coordinator directs each step) is easier to debug; choreography (services react to each other's events) is more resilient. Most teams underestimate choreography's operational complexity.
  • Idempotency is non-negotiable: saga steps and compensations are retried on failure — a non-idempotent compensation causes double refunds, double releases, and data corruption.

The Problem

An e-commerce checkout flow: charge the card → reserve inventory → create shipment. The card is charged successfully. Inventory is out of stock — reservation fails. Now what? With 2PC, all three operations would be rolled back atomically. Without it (the normal case in microservices with separate databases), the charge is committed and there is no automatic rollback. Without a saga, the engineering team discovers this edge case in production when a customer is charged for an item that will never ship.

Core System Idea

A saga models a distributed business process as a sequence of local ACID transactions, one per service. Each step commits locally and triggers the next step via an event or command. If any step fails, the saga executes compensating transactions in reverse order for all previously completed steps. Compensation is not 2PC rollback — it's a new forward transaction that neutralizes the effect (refund the charge, release the reservation). Two coordination topologies: Orchestration — a central saga coordinator (Temporal, Netflix Conductor, AWS Step Functions) maintains saga state, directs each participant, handles retries and compensation. Best for complex, long-running flows with branching logic. Choreography — services react to each other's events via a message broker (Kafka, SQS) with no central coordinator. Best for simple, reactive flows. Both require every step to be idempotent and every compensation to be defined upfront.

System Flow

flowchart TD A["Client"] --> B["Saga Orchestrator"] B --> C["Payment Service"] C -- "OK" --> D["Inventory Service"] C -- "Fail" --> G["End: Failed"] D -- "OK" --> E["Shipping Service"] D -- "Fail" --> F["Compensate: Refund"] E -- "OK" --> H["End: Success"]

Orchestrated saga: each step triggers the next on success; failure triggers compensating transactions in reverse.

Real-World Examples Indicative

Shopify's checkout flow

At Black Friday scale (millions of orders/hour), Shopify's checkout saga runs: inventory reservation → payment authorization → order confirmation → fulfillment trigger. If payment authorization fails after inventory is reserved, a compensating transaction releases the reservation immediately. Shopify uses a centralized workflow engine to track saga state — at this volume, any untracked partial failure creates a reconciliation nightmare.

Uber's trip lifecycle

A ride request saga: driver matching → driver acceptance → pickup confirmation → trip completion → payment processing. If payment fails after trip completion, Uber's compensation logs the debt and schedules a retry — they don't reverse the trip, they forward-compensate. This is the key insight: compensation is a business decision, not a technical rollback, and it must be designed per-step.

Netflix Conductor

Netflix's open-source workflow orchestration engine manages saga-like flows for media encoding, account provisioning, and content delivery. Each workflow step is a microservice task; Conductor tracks state, handles retries with configurable backoff, and triggers compensation workflows on failure. Conductor persists saga state in Cassandra — orchestrator restarts don't lose in-flight workflows.

Anti-Patterns

Missing compensating transactions

Implementing the happy path and adding compensations "later." Later never comes until a customer is charged for an out-of-stock item. Define and test compensations before shipping the saga.

Non-idempotent compensation

A refund compensation that issues a new charge-reversal on every retry will refund the customer multiple times if the first attempt times out. Every compensation must be safe to execute multiple times.

Deep choreography chains

A 7-step saga implemented as pure event choreography — each service emits an event that triggers the next. When step 4 fails, determining what compensations have already run requires correlating 6 different event logs. Orchestration makes this visible in one place.

Synchronous calls inside saga steps

Calling another service synchronously within a saga step to "complete the step atomically." This creates distributed deadlocks when two sagas wait on each other's services. Each saga step must be a single local transaction.

No saga state persistence

An in-memory orchestrator that loses its state on restart leaves all in-flight sagas in an unknown state — some steps completed, compensations not triggered. Saga state must be persisted to a durable store before any step is executed.

Design Tradeoffs

DimensionOrchestrationChoreography
Flow visibilitySingle trace in orchestratorDistributed, event correlation required
Single point of failureYes (orchestrator)No
Compensation trackingExplicit in orchestratorImplicit, harder to verify
Best forComplex branching, long-running flowsSimple, reactive event chains
ToolingTemporal, Conductor, Step FunctionsKafka + event-driven consumers

Best Practices

Write the compensation for each step before writing the step itself. If you can't define the compensation, you can't safely implement the step.
Assign a unique saga ID at creation and include it in every step's local transaction and every emitted event. This is your correlation handle for debugging and compensation tracking.
Persist saga state to a durable store (database, DynamoDB) before executing each step. The orchestrator must survive restarts without losing in-flight saga progress.
Test compensation paths explicitly: inject failures at each step in your test environment and verify that the correct compensations run and that the system reaches a consistent state.
Set timeouts on every saga step. A step waiting indefinitely for a downstream service blocks the saga forever. Define what happens when a step times out — compensation or retry — explicitly.

When to Use / Avoid

Use WhenAvoid When
Business process spans multiple services with separate databasesSingle service with one database — use a local ACID transaction
2PC is impractical due to service heterogeneity or scaleImmediate strong consistency is required — saga's eventual consistency is unacceptable
Steps can be compensated with meaningful business actionsCompensation is impossible (irreversible external side effects like sending an SMS)
Long-running workflows (seconds to minutes) span multiple servicesSimple 2-service coordination where a shared DB transaction is feasible