← System Design Backend Architectures
System Design

Event Sourcing

Event sourcing stores state as an immutable, append-only sequence of domain events rather than overwriting rows — current state is derived by replaying events from the last snapshot; the event log is both the audit trail and the authoritative data source.

TL;DR
  • Event sourcing stores state as an immutable, append-only sequence of domain events. Current state is derived by replaying events from the last snapshot — the log is the source of truth, not overwritten rows.
  • Jet.com used EventStoreDB for their pricing engine: each product's price history was an event stream. At checkout, pricing was computed by replaying the last N events from the snapshot — average 8 events per product, not a full history scan.
  • Uber uses Kafka topics as an event store for trip processing: TripStarted, LocationUpdated, TripEnded events in log-compacted topics. In-progress trip state is cached in Redis (keyed by trip ID + last event offset) to avoid full replay.
  • Snapshots are mandatory above ~50 events per aggregate. Without snapshots, each state read replays the full history — a high-volume entity with 10,000 events takes 10,000 sequential reads to reconstruct current state.
  • Schema evolution is the most operationally dangerous aspect: renaming a field in an event struct breaks replay of all historical events in that aggregate. Use upcasters — transformation functions applied at read time to convert old schemas to the current schema.

The Problem

An e-commerce company needs to answer a customer dispute: "What was the price of this item when I added it to my cart 3 weeks ago?" The answer is unknowable — the product table stores the current price, and the order table stores the final checkout price, but neither captures what the price was at cart-add time. Engineers attempt to reconstruct the answer from application logs — which rotate after 14 days. The compliance team requests a full audit of every price change for regulatory review; the engineering team discovers that UPDATE product SET price=... WHERE id=... leaves no trace. Every meaningful question about historical state requires either a forensic log reconstruction or manual audit tables that were never built.

Core System Idea

Event sourcing makes events the absolute source of truth. Instead of overwriting a Product row's price column, the system appends immutable events to an event store: ProductListed{price: 99}, PriceReduced{new_price: 79, reason: 'flash_sale'}, PriceRestored{new_price: 99}. To determine current price: replay the three events sequentially and apply them to an empty product struct. Every historical question is answerable by replaying events up to a specific timestamp. Three components: (1) Event store — an append-only, ordered log of events per aggregate ID. EventStoreDB and Kafka (with log compaction disabled) are purpose-built; PostgreSQL with an events table and SERIAL sequence also works at smaller scale. (2) Snapshots — periodic materializations of aggregate state, taken every N events (50–1,000 depending on event size and replay frequency). Replaying starts from the latest snapshot + subsequent events, not from the beginning. Without snapshots, replay cost grows linearly with event count — an aggregate with 100,000 events requires 100,000 sequential reads per state reconstruction. (3) Projections — background workers that consume the event log and build read models (Elasticsearch, Redis, PostgreSQL). Projections are derived views and can be rebuilt from scratch by replaying the full event log — this is the mechanism for adding new features retroactively.

System Flow

flowchart TD A["Client Command"] --> B["Command Service"] B -- "1. Load Snapshot" --> C[("Event Store")] B -- "2. Append Event" --> C C -- "3. Trigger Snapshot" --> D[("Snapshot Store")] C -- "4. Publish" --> E["Event Bus"] E --> F["Read Model Projector"] F --> G[("Read DB")]

The event store receives appended events, triggers periodic snapshots, and publishes events to projection workers that build queryable read models.

Real-World Examples Indicative

Jet.com pricing engine with EventStoreDB

Jet.com (acquired by Walmart for $3.3B in 2016) used EventStoreDB (purpose-built event store by Greg Young, original author of the CQRS/ES pattern) for their real-time pricing engine. Each product's pricing history was a separate event stream: ProductListed, PriceAdjusted, PromotionApplied, BundleDiscountApplied. Jet's pricing algorithm was cart-aware — the final price depended on other items in the cart, requiring fresh computation at checkout time. Snapshots were taken every 100 events; the average checkout replay read ~8 events (snapshot + average 8 subsequent price adjustments). EventStoreDB's persistent subscriptions pushed events to Elasticsearch (product catalog) and Redis (real-time pricing cache). The full event history made every customer dispute resolvable: the exact price at any moment in time was a deterministic replay query.

Uber trip processing with Kafka as event store

Uber uses Apache Kafka with log-compacted topics as a practical event store for trip processing. Events: TripStarted{driver_id, rider_id, origin}, LocationUpdated{lat, lng, timestamp}, TripEnded{destination, duration, fare}. Each driver is assigned to a fixed Kafka partition — all events for that driver land in the same partition, preserving per-driver event ordering. In-progress trip state is cached in Redis (keyed by trip_id + last_offset) to avoid full event replay for ongoing trips. At trip completion, the Redis cache is evicted. Kafka's log.retention.ms=-1 (indefinite retention) on the trips topic enables historical replay for analytics and dispute resolution. The key difference from EventStoreDB: Kafka supports millions of events per second (streaming throughput) but does not natively enforce aggregate-level optimistic concurrency — two writers can append to the same aggregate stream simultaneously without conflict detection.

ABN AMRO mortgage processing with Axon Server

ABN AMRO (Dutch bank) migrated mortgage application processing to event sourcing using Axon Server. Domain events: MortgageApplicationSubmitted, DocumentsVerified, AppraisalOrdered, AppraisalCompleted, UnderwritingApproved, MortgageDisbursed. Each application generates 12–20 events over its lifecycle (2–6 weeks). Regulatory requirement: 7-year retention of complete application history for audit. With event sourcing, audit compliance required zero additional code — the event log was the audit trail. ABN AMRO's compliance team queries historical state directly via event replay APIs, without requiring engineers to write custom audit queries. Snapshots every 20 events ensure state reconstruction in <5ms for the maximum event count per application.

Anti-Patterns

Modifying historical events

Running UPDATE or DELETE on the event log to fix a data bug. Mutable events corrupt the audit trail retroactively — every projection built from the event log is now inconsistent with the "corrected" history. Fix bugs by appending a compensating event (PriceCorrection, QuantityAdjustment) that explicitly records what was wrong and what the corrected value is.

Querying the event store directly for reads

Writing SQL joins or aggregations directly against the events table for reporting. The event store is append-optimized, not query-optimized. Direct queries scan raw events in sequence; projections pre-compute the aggregated view. Always read from projections.

No snapshots on high-event-count aggregates

An order aggregate with 10,000 events requires 10,000 sequential reads and 10,000 event handler invocations to reconstruct current state. Without snapshots, this becomes the dominant cost for frequently accessed aggregates — taking 50–500ms per state load.

Ignoring event versioning during schema evolution

Renaming a field from amount to price_cents in the PriceAdjusted event struct breaks deserialization of all historical PriceAdjusted events that used the old field name. Every replay fails. Use upcasters: transformation functions registered for old event schema versions that convert them to the current schema during read time.

Design Tradeoffs

DimensionEvent SourcingState Sourcing (Traditional CRUD)
Audit trailComplete and automatic — every state change is an immutable event in the logRequires explicitly built audit tables; often incomplete or added after the fact
Write patternAppend-only: no lock contention, no index updates, no UPDATE/DELETE conflictsRow locks and index updates on every mutation; contention increases with concurrent writers
Read latencyRequires replay from snapshot + N events; adds 1–50ms depending on snapshot frequencyDirect single-row lookup: <1ms regardless of entity history length
Schema evolutionDangerous: old events must remain deserializable; requires upcasters for every field changeSafe: ALTER TABLE adds columns; existing rows remain valid without transformation

Best Practices

Define events as past-tense domain facts, not technical operations: OrderShipped, not OrderUpdated; InventoryReserved, not InventoryUpdate. Events that name business occurrences are self-documenting and stable — technical operation names change with implementation.
Implement snapshotting before the average event count per aggregate exceeds 50. Define the threshold based on measured replay time: snapshot when replay exceeds 10ms. Store snapshots in a separate table or store keyed by aggregate_id + version so stale snapshots are never accidentally served.
Version every event type from day one: PriceAdjusted_v1, PriceAdjusted_v2. When the schema changes, write an upcaster for the old version. Never rename fields in place — always add a new version. Event versioning is backwards compat for data, not just APIs.
Ensure all projection workers are idempotent: processing the same event twice produces the same read model state. Track the last processed event sequence number per projection — if the incoming event has a lower sequence number than the stored last-processed, skip it without error.
Test full event replay of your entire event store quarterly. A replay that fails due to schema drift, missing upcasters, or corrupted events is a data integrity incident — not a schema migration.

When to Use / Avoid

Use WhenAvoid When
A legally compliant, complete audit trail is a hard business requirement — finance, healthcare, regulated industriesThe application is a simple CRUD system with no historical tracking requirements and no compliance mandates
Temporal queries are required — reconstruct system state at any point in the past for debugging, dispute resolution, or analyticsUltra-low-latency read requirements where the replay overhead of even snapshot + N events is unacceptable
The domain model is rich and event-driven — order lifecycles, financial transactions, state machines with many transitionsTeam is unfamiliar with asynchronous patterns, eventual consistency, and schema evolution — the operational complexity will dominate