Workflow Orchestration vs Choreography
Orchestration centralizes workflow control for visibility and compensation; choreography decentralizes it for scale and loose coupling — the choice determines where failure handling lives and how hard post-incident debugging is.
- Orchestration centralizes workflow state in one place — a single Temporal workflow or Step Functions execution shows you exactly where a business process is. Choreography distributes it across event logs — reconstructing state requires correlating events across 5 services.
- Netflix Conductor persists saga state in Cassandra, so orchestrator restarts don't lose in-flight workflows. An in-memory orchestrator that loses state on restart is not orchestration — it's a fragile coordinator.
- Choreography's fan-out advantage is real: adding a new consumer to
OrderPlacedrequires zero changes to the order service. In orchestration, adding a new step requires an orchestrator code change and deploy. - The God Orchestrator anti-pattern is the most common failure: all business logic in one service, downstream services as dumb CRUD endpoints. Orchestration should own workflow coordination, not business rules.
- Most production systems use both: orchestration within a bounded context (payment flow), choreography across domains (order placed notifying marketing, analytics, shipping independently).
The Problem
A travel booking system calls flight, hotel, and payment services in sequence via direct HTTP calls. The flight service calls the hotel service, which calls payment. When the payment service is slow, the hotel service thread is blocked, which blocks the flight service thread. All three services degrade together — tight coupling through a synchronous call chain. The team refactors to async messaging, but now nobody knows the overall state of a booking: was it the hotel or the payment step that failed at 2am? Without explicit workflow state, debugging distributed failures requires correlating timestamps across three separate log streams.
Core System Idea
Orchestration and choreography are two approaches to coordinating multi-service workflows. Orchestration uses a central coordinator (Temporal, Netflix Conductor, AWS Step Functions) that owns the workflow definition and directs each service step-by-step. The orchestrator knows the full state at all times, handles retries and timeouts, and executes compensating transactions on failure. Choreography uses a shared event broker (Kafka, SQS, EventBridge) where services react to events without a central coordinator. Each service does its local work and emits a new event; downstream services subscribe and react. No single service has a complete view of the workflow. The practical choice: orchestration within a bounded context where complex branching, compensation, and observability matter most; choreography across domain boundaries where fan-out and independent scaling matter most. Most production systems at scale use both — Temporal orchestrates the payment flow, while PaymentCompleted events fan out to analytics, notifications, and fraud review via Kafka.
System Flow
Orchestration: central coordinator directs each step and tracks state. Choreography: services react to events with no coordinator — workflow is implicit in the event chain.
Real-World Examples Indicative
Netflix open-sourced Conductor to manage media encoding pipelines where every title upload triggers 15+ quality variants, DRM packaging, CDN distribution, and metadata indexing as orchestrated workflow steps. Conductor persists workflow state in Cassandra — orchestrator restarts don't lose in-flight workflows. At 10,000+ concurrent encoding jobs, a Conductor operator can query the exact state of any job from a single dashboard: which step completed, which failed, what the retry count is. This observability is operationally impossible to replicate with pure choreography without building a separate event correlation service.
Stripe uses Temporal for payment flow orchestration. The workflow code reads like sequential business logic — charge card, reserve inventory, create shipment — but Temporal makes it durable: if the shipment service times out, Temporal retries the step transparently without re-executing the already-committed card charge. Compensation (refund the card, release inventory) is co-located with the workflow code, not distributed across event handlers in separate services. This is the key advantage over saga-via-choreography: compensation logic lives in one file, auditable and testable in isolation.
When OrderPlaced is published to Kafka, the fraud detection, inventory, notification, and analytics services each consume independently — no orchestrator. Shopify added a carbon offset calculation feature (subscribing to OrderPlaced) with zero changes to the order service. This is choreography's primary advantage: fan-out to new consumers is operationally free. During Black Friday, each consumer scales independently — the analytics consumer lags hours behind without affecting payment processing. A centralized orchestrator calling analytics synchronously would have brought down the checkout flow.
Anti-Patterns
All business logic in the orchestrator service; downstream services are dumb CRUD endpoints. This recreates a monolith with distributed overhead. Orchestrators should coordinate steps, not contain business rules.
Service A emits an event that triggers Service B, which triggers Service C, which triggers Service A again. This creates an infinite event loop that generates unbounded message volume. Map your event graph before shipping choreography.
Calling downstream services synchronously (blocking HTTP) from the orchestrator without timeouts. If the payment service takes 10 seconds, the orchestrator thread is blocked for 10 seconds. Orchestrators must use async step execution with explicit timeouts — not synchronous blocking calls.
Event-driven choreography without W3C Trace Context propagation. When an order disappears mid-workflow, debugging requires correlating timestamps across 5 log streams. Pass a correlation ID through every event header from the moment the workflow starts.
Design Tradeoffs
| Dimension | Orchestration | Choreography |
|---|---|---|
| Workflow visibility | Full — single trace in orchestrator dashboard | Partial — must correlate events across service logs |
| Failure handling | Centralized compensation in workflow code | Each service emits failure events; hard to coordinate rollback |
| Service coupling | Orchestrator knows all downstream APIs | Services know only event schema and broker |
| Single point of failure | Yes (orchestrator service) | No |
| Best for | Complex branching, compensation, audit requirements | High-throughput fan-out, independent team scaling |
Best Practices
When to Use / Avoid
| Choose Orchestration When | Choose Choreography When |
|---|---|
| Complex conditional branches, loops, or SLA timeouts require central coordination | Services must deploy and scale completely independently across different teams |
| Centralized audit trail and real-time workflow state visibility are required | Adding new consumers to events must not require changes to the producing service |
| Compensation (saga rollback) across multiple services must be reliable and testable | High-throughput event fan-out where any central coordinator would become a bottleneck |