← System Design Workflow Engineering
System Design

Workflow Orchestration vs Choreography

Orchestration centralizes workflow control for visibility and compensation; choreography decentralizes it for scale and loose coupling — the choice determines where failure handling lives and how hard post-incident debugging is.

TL;DR
  • Orchestration centralizes workflow state in one place — a single Temporal workflow or Step Functions execution shows you exactly where a business process is. Choreography distributes it across event logs — reconstructing state requires correlating events across 5 services.
  • Netflix Conductor persists saga state in Cassandra, so orchestrator restarts don't lose in-flight workflows. An in-memory orchestrator that loses state on restart is not orchestration — it's a fragile coordinator.
  • Choreography's fan-out advantage is real: adding a new consumer to OrderPlaced requires zero changes to the order service. In orchestration, adding a new step requires an orchestrator code change and deploy.
  • The God Orchestrator anti-pattern is the most common failure: all business logic in one service, downstream services as dumb CRUD endpoints. Orchestration should own workflow coordination, not business rules.
  • Most production systems use both: orchestration within a bounded context (payment flow), choreography across domains (order placed notifying marketing, analytics, shipping independently).

The Problem

A travel booking system calls flight, hotel, and payment services in sequence via direct HTTP calls. The flight service calls the hotel service, which calls payment. When the payment service is slow, the hotel service thread is blocked, which blocks the flight service thread. All three services degrade together — tight coupling through a synchronous call chain. The team refactors to async messaging, but now nobody knows the overall state of a booking: was it the hotel or the payment step that failed at 2am? Without explicit workflow state, debugging distributed failures requires correlating timestamps across three separate log streams.

Core System Idea

Orchestration and choreography are two approaches to coordinating multi-service workflows. Orchestration uses a central coordinator (Temporal, Netflix Conductor, AWS Step Functions) that owns the workflow definition and directs each service step-by-step. The orchestrator knows the full state at all times, handles retries and timeouts, and executes compensating transactions on failure. Choreography uses a shared event broker (Kafka, SQS, EventBridge) where services react to events without a central coordinator. Each service does its local work and emits a new event; downstream services subscribe and react. No single service has a complete view of the workflow. The practical choice: orchestration within a bounded context where complex branching, compensation, and observability matter most; choreography across domain boundaries where fan-out and independent scaling matter most. Most production systems at scale use both — Temporal orchestrates the payment flow, while PaymentCompleted events fan out to analytics, notifications, and fraud review via Kafka.

System Flow

flowchart TD subgraph Orchestration A["Orchestrator"] -->|"1. Charge Card"| B["Payment Service"] B -->|"Success"| A A -->|"2. Reserve Stock"| C["Inventory Service"] end subgraph Choreography D["Order Service"] -->|"OrderPlaced event"| E["Event Broker"] E -->|"consume"| F["Payment Service"] F -->|"PaymentSucceeded event"| E E -->|"consume"| G["Inventory Service"] end

Orchestration: central coordinator directs each step and tracks state. Choreography: services react to events with no coordinator — workflow is implicit in the event chain.

Real-World Examples Indicative

Netflix Conductor (Orchestration)

Netflix open-sourced Conductor to manage media encoding pipelines where every title upload triggers 15+ quality variants, DRM packaging, CDN distribution, and metadata indexing as orchestrated workflow steps. Conductor persists workflow state in Cassandra — orchestrator restarts don't lose in-flight workflows. At 10,000+ concurrent encoding jobs, a Conductor operator can query the exact state of any job from a single dashboard: which step completed, which failed, what the retry count is. This observability is operationally impossible to replicate with pure choreography without building a separate event correlation service.

Temporal at Stripe (Orchestration)

Stripe uses Temporal for payment flow orchestration. The workflow code reads like sequential business logic — charge card, reserve inventory, create shipment — but Temporal makes it durable: if the shipment service times out, Temporal retries the step transparently without re-executing the already-committed card charge. Compensation (refund the card, release inventory) is co-located with the workflow code, not distributed across event handlers in separate services. This is the key advantage over saga-via-choreography: compensation logic lives in one file, auditable and testable in isolation.

Shopify's cross-domain choreography

When OrderPlaced is published to Kafka, the fraud detection, inventory, notification, and analytics services each consume independently — no orchestrator. Shopify added a carbon offset calculation feature (subscribing to OrderPlaced) with zero changes to the order service. This is choreography's primary advantage: fan-out to new consumers is operationally free. During Black Friday, each consumer scales independently — the analytics consumer lags hours behind without affecting payment processing. A centralized orchestrator calling analytics synchronously would have brought down the checkout flow.

Anti-Patterns

The God Orchestrator

All business logic in the orchestrator service; downstream services are dumb CRUD endpoints. This recreates a monolith with distributed overhead. Orchestrators should coordinate steps, not contain business rules.

Choreographed circular dependencies

Service A emits an event that triggers Service B, which triggers Service C, which triggers Service A again. This creates an infinite event loop that generates unbounded message volume. Map your event graph before shipping choreography.

Synchronous calls inside orchestration

Calling downstream services synchronously (blocking HTTP) from the orchestrator without timeouts. If the payment service takes 10 seconds, the orchestrator thread is blocked for 10 seconds. Orchestrators must use async step execution with explicit timeouts — not synchronous blocking calls.

Choreography without distributed tracing

Event-driven choreography without W3C Trace Context propagation. When an order disappears mid-workflow, debugging requires correlating timestamps across 5 log streams. Pass a correlation ID through every event header from the moment the workflow starts.

Design Tradeoffs

DimensionOrchestrationChoreography
Workflow visibilityFull — single trace in orchestrator dashboardPartial — must correlate events across service logs
Failure handlingCentralized compensation in workflow codeEach service emits failure events; hard to coordinate rollback
Service couplingOrchestrator knows all downstream APIsServices know only event schema and broker
Single point of failureYes (orchestrator service)No
Best forComplex branching, compensation, audit requirementsHigh-throughput fan-out, independent team scaling

Best Practices

Use orchestration within a bounded context (payment, KYC, order fulfillment) and choreography across domain boundaries (order placed notifying marketing, analytics, shipping). Mixing both is the production standard at scale.
Persist orchestrator state to a durable store (Temporal's history, Conductor's Cassandra, Step Functions' execution history) before executing any step. An in-memory orchestrator is a coordinator that will lose state during deploy.
Propagate a correlation ID through every orchestration step and every choreographed event. This single field collapses a 45-minute post-incident debugging session into a 2-minute log query.
In choreography, every event handler must be idempotent. Kafka and SQS deliver at-least-once — duplicate events are a guarantee, not an edge case. Test duplicate delivery explicitly in your integration tests.
For orchestration compensation (saga rollback), define compensating steps before implementing the happy path. If you can't define how to undo a step, you can't safely include it in the workflow.

When to Use / Avoid

Choose Orchestration WhenChoose Choreography When
Complex conditional branches, loops, or SLA timeouts require central coordinationServices must deploy and scale completely independently across different teams
Centralized audit trail and real-time workflow state visibility are requiredAdding new consumers to events must not require changes to the producing service
Compensation (saga rollback) across multiple services must be reliable and testableHigh-throughput event fan-out where any central coordinator would become a bottleneck