← System Design Workflow Engineering
System Design

State Machine Design

Model complex business workflows as explicit, deterministic state machines — a single status column with enforced transition rules eliminates the race conditions and invalid states that arise from scattered boolean flags.

TL;DR
  • Replace scattered boolean flags (is_approved, has_paid, is_shipped) with a single status enum column and an explicit transition table — the flags approach makes it impossible to guarantee mutually exclusive states.
  • Stripe's Payment Intents API enforces state machine transitions at the API layer: attempting to confirm a PaymentIntent that hasn't collected a payment method returns invalid_transition, not a server error — the constraint lives in the service, not just the database.
  • Side effects (email sends, external API calls) must execute after the state transition commits, not inside the transaction. A hung external call inside a DB transaction holds a connection open until timeout.
  • The transactional outbox pattern ties side-effect dispatch to the state write atomically — write the new state and the outbox event in one transaction; a separate dispatcher reads and delivers the event.
  • Temporal stores workflow state as an append-only event history — a crashed workflow replays its history on recovery and resumes from the last committed step without re-executing side effects.

The Problem

An e-commerce system tracks orders using five boolean columns: payment_confirmed, inventory_reserved, label_printed, shipped, delivered. A race condition during a refund flow sets payment_confirmed = false without checking shipped — the system attempts to un-reserve inventory that has already left the warehouse. Another race condition lets an order reach shipped = true while payment_confirmed = false because both flags are updated in separate transactions. With boolean flags, the total number of theoretically possible states is 2^5 = 32, but only 5–6 are valid business states — the other 26 are invalid combinations, and the system has no mechanism to reject them.

Core System Idea

A state machine models a workflow as a formal set of named states, a set of events that trigger transitions, and a deterministic transition table mapping (current_state, event) → next_state. When an event arrives, the engine loads the current state, validates the transition is legal, writes the new state atomically, and only then dispatches side effects. Invalid transitions are rejected at the boundary — no amount of concurrent requests or racing updates can produce an illegal state because the transition table is the only path to state change. Two implementation approaches: (1) Persisted state machine (database-driven) — the status column is the state; transitions are validated in application code or a state machine library (XState, transitions, python-statemachine); state writes use optimistic locking (WHERE version = $current_version) to prevent race conditions. (2) Durable workflow engine (Temporal, AWS Step Functions) — state is stored as an append-only event history; a crashed workflow replays its history on restart, skipping already-completed steps and resuming from the exact point of failure. The workflow engine approach is ideal for long-running multi-service workflows; the database approach is ideal for entity-level state (order status, payment status) managed within a single service.

System Flow

flowchart TD A["Incoming Event"] --> B["State Machine Engine"] B --> C["Load Current State"] C --> D{"Valid Transition?"} D -- "No" --> E["Reject: Invalid Transition"] D -- "Yes" --> F["Write New State + Version"] F --> G["Commit Transaction"] G --> H["Write Outbox Event"] H --> I["Async Side Effect Dispatcher"]

Transition validated before state write; side effects dispatched after commit via outbox — no external calls inside the transaction.

Real-World Examples Indicative

Stripe's Payment Intents API

The PaymentIntent lifecycle has 7 states: requires_payment_method → requires_confirmation → requires_action → processing → requires_capture → succeeded | canceled. Any transition not in this table returns invalid_transition in the error body — Stripe enforces this at the API service layer, not just the database. A client that tries to capture a PaymentIntent still in processing receives an immediate 400 with code: invalid_transition, preventing partial-capture race conditions that would corrupt ledger entries.

GitHub Actions workflow jobs

Each workflow job progresses through: pending → queued → in_progress → completed (with terminal sub-states: success, failure, cancelled, skipped). Concurrent matrix job entries are isolated state machines — a failure in one matrix entry doesn't force-transition other entries to cancelled unless fail-fast: true is configured. Every state transition is recorded in the GitHub Actions API with nanosecond timestamps and the actor ID, providing a complete immutable audit trail for compliance and post-incident analysis.

Temporal at Coinbase

Coinbase uses Temporal for multi-day KYC (Know Your Customer) verification workflows with states: identity_submitted → documents_uploaded → risk_assessed → compliance_reviewed → approved | rejected. Temporal stores each state transition as an event in an append-only history log. If the compliance review service crashes mid-transition, Temporal replays the history on recovery — the identity_submitted and documents_uploaded activities are replayed from their cached results (not re-executed), and the workflow resumes at risk_assessed. This guarantees exactly-once execution of each activity even across infrastructure failures, without requiring the application to implement its own checkpointing.

Anti-Patterns

Implicit state flags

Multiple independent boolean columns (is_verified, has_paid, is_shipped) allow invalid combinations — an order can be is_shipped=true and has_paid=false simultaneously with no constraint to prevent it. Replace with a single status enum and an explicit transition table.

In-memory state machines for long-running workflows

Keeping workflow state in application memory. A server restart, deploy, or crash loses all in-flight states — orders stuck mid-checkout, KYC reviews lost, subscriptions left in processing forever.

Side effects inside the database transaction

Executing a credit card charge, email send, or external API call inside the transaction that writes the new state. If the external call hangs for 30 seconds, the DB connection is held for 30 seconds. At 100 concurrent transitions, this exhausts the connection pool.

State explosion from one monolithic machine

A single state machine trying to model every sub-process (payment, shipping, returns, disputes) produces a transition matrix with hundreds of entries. Decompose into nested machines: an Order machine that delegates to a separate Payment machine and a separate Fulfillment machine.

Design Tradeoffs

DimensionPersisted State (Database)Durable Workflow Engine
Crash resilienceFull — state survives restartsFull — replays from event history
Transition latency2–10ms (DB write per transition)10–50ms (history append + worker poll)
Long-running workflow supportLimited (process must stay alive or poll)Native — workflows span days or weeks
Audit trailEach DB write is a state recordAppend-only event history, fully replayable
Best forEntity-level state in one serviceMulti-service, multi-day business processes

Best Practices

Store state in a single status string or enum column. Use optimistic locking: UPDATE orders SET status = $new, version = version + 1 WHERE id = $id AND status = $expected AND version = $current_version. A version mismatch means a concurrent transition won — retry with the current state.
Use the transactional outbox pattern for side effects: write the new state and an outbox event record in a single database transaction. A separate outbox processor reads and dispatches the event. This guarantees the side effect fires if and only if the state transition succeeds.
Write an entry to a state transition audit log on every transition: (entity_id, old_state, new_state, event, timestamp, actor_id). This log is your compliance record and your primary debugging tool for production incidents.
Reject invalid transitions at the API boundary with a structured error body: {"code": "invalid_transition", "current_state": "shipped", "requested_event": "reserve_inventory"}. Surface the specific constraint violation — not a generic 400.
Decompose complex workflows into nested state machines rather than one monolithic machine. An Order machine at states pending | confirmed | fulfilled | cancelled delegates to a Payment sub-machine and a Fulfillment sub-machine. Each sub-machine has a bounded set of states that is easy to test exhaustively.

When to Use / Avoid

Use WhenAvoid When
Business process has sequential phases with strict rules about valid transitionsWorkflow is highly dynamic and unstructured — users jump between steps arbitrarily
Multiple concurrent actors can modify the same entity, creating race condition riskSystem is simple CRUD with no multi-step processes or concurrency concerns
Audit trail of how an entity reached its current state is required for complianceHigh-frequency, low-latency pipelines where DB transactions introduce unacceptable overhead
Invalid state combinations must be prevented at the system boundaryWorkflow is trivially linear with one possible path — a state machine adds no value