← System Design Workflow Engineering
System Design

Event-Driven Architecture

Events decouple who does something from who needs to know about it — the producer never waits for consumers, and adding a new consumer requires zero changes to the producer.

TL;DR
  • Events decouple who does something from who needs to know about it — the producer never waits for consumers, and adding a new consumer requires zero changes to the producer.
  • Event schema breaking changes are silent production failures: consumers crash or silently corrupt data until someone notices. Use a schema registry (Confluent, AWS Glue) and enforce compatibility at publish time.
  • Strict ordering across partitions is impossible without a single partition — and a single partition is a throughput ceiling. Design consumers to be order-tolerant where possible.
  • At-least-once delivery is the default for Kafka, SQS, and most brokers. Your consumers must be idempotent — receiving the same event twice must produce the same result.
  • Consumer lag is your primary operational signal: growing lag means consumers can't keep up, and unbounded lag means eventual data loss when broker retention expires.

The Problem

A payment service calls the notification service synchronously after processing a transaction. The notification service is slow during a marketing campaign — every payment now takes 800ms instead of 50ms. Then the notification service goes down, taking payments down with it. These two services have no business being coupled: a payment completing is a fact; who gets told about it and when is a separate concern. Synchronous coupling transforms one service's availability into another's dependency — and at scale, every tight coupling is a future incident.

Core System Idea

Event-Driven Architecture (EDA) replaces synchronous service-to-service calls with immutable events published to a durable broker. A producer emits a fact (OrderPlaced, PaymentProcessed) to a named topic in Kafka, SQS, or Pub/Sub — it does not know who consumes it or when. Consumers subscribe independently, processing events at their own pace. The broker provides durability (events survive consumer restarts), fan-out (one event reaches N consumers), and replay (consumers can reprocess historical events). Key operational properties: producers are never blocked by consumer slowness; consumers fail without affecting producers; new consumers can be added without producer changes; events can be replayed to rebuild read models or recover from bugs. Kafka retains events for days or weeks; SQS retains for up to 14 days.

System Flow

flowchart TD A["Producer Service"] --> B["Event Broker (Kafka)"] B --> C["Consumer A"] B --> D["Consumer B"] B --> E["Consumer C"] C --> F["Dead Letter Queue"]

Producer publishes once; broker fans out to all subscribers independently. Failed consumer messages route to DLQ.

Real-World Examples Indicative

Uber's ride platform

Every state transition in a trip (RideRequested, DriverAccepted, PickupConfirmed, TripCompleted) is published to Kafka. The dispatch service, payment service, ETA service, and surge pricing service each consume independently. When Uber processes 15M+ trips per day, this fan-out architecture means adding a new feature (e.g., a carbon offset service) requires zero changes to the trip service — just a new consumer subscribing to TripCompleted.

LinkedIn's activity feed

Every profile view, connection request, and post like is an event. The feed aggregation service, notification service, and analytics service all consume the same event streams independently. LinkedIn processes over 5 trillion events per day through Kafka — a volume that would be impossible with synchronous RPC between services. Consumer lag on the feed aggregation topic is a primary SLO metric; sustained lag above 30 seconds triggers paging.

Shopify's order pipeline

An OrderPlaced event fans out to inventory reservation, fraud detection, payment authorization, fulfillment, and customer notification consumers. Each runs independently — if fraud detection is slow, it doesn't block payment or fulfillment. Events are stored in Kafka with 7-day retention, enabling replay when a downstream service has a bug that corrupts data: fix the bug, reset the consumer offset, and reprocess.

Anti-Patterns

Events as RPC with reply-to

Expecting a consumer to publish a response event that the producer then waits for. This is synchronous coupling with extra steps — you get all the latency and failure coupling of RPC plus the complexity of async messaging.

Breaking schema changes without versioning

Adding a required field to an event schema and deploying producers before consumers are updated. Consumers crash or silently ignore new events until updated. Use Confluent Schema Registry with BACKWARD compatibility mode — new schemas must be readable by old consumers.

Relying on global event ordering

Kafka guarantees order within a partition, not across partitions. Routing the same entity's events to different partitions breaks ordering. Design consumers to handle out-of-order events using event timestamps, not arrival order.

Fat events with full object snapshots

Embedding the entire order object in every OrderUpdated event. When the order schema changes, all consumers need updates. Prefer thin events (entity ID + changed fields) or at least version the payload explicitly.

No consumer lag monitoring

A consumer that falls 10 minutes behind is indistinguishable from a healthy consumer unless you're tracking lag. When Kafka retention expires and the consumer hasn't caught up, events are permanently lost.

Design Tradeoffs

DimensionFat EventsThin Events
PayloadFull object snapshotEntity ID + delta only
Consumer queries neededNone (self-contained)Yes (fetch current state)
Schema couplingHigh (producer schema embedded)Low (stable ID contract)
Best forRead-heavy, many consumersWrite-heavy, schema-volatile producers

Best Practices

Use a schema registry (Confluent, AWS Glue, Apicurio) and enforce BACKWARD or FULL compatibility on every schema change. Reject breaking changes at CI time, not at consumer runtime.
Design every consumer as idempotent from day one. Include an event ID in every payload and check for duplicates before processing. At-least-once delivery is a guarantee, not an edge case.
Monitor consumer group lag per topic-partition. Alert when lag exceeds your processing SLO (e.g., >5 minutes for a real-time feed). Unbounded lag eventually crosses broker retention and causes data loss.
Configure dead letter queues for every consumer. Events that fail after N retries must go somewhere inspectable — silent drops are the worst outcome in event-driven systems.
Partition by entity ID (user ID, order ID) to preserve per-entity ordering while enabling parallel processing across entities. Never partition randomly if ordering matters for any consumer.
Include event metadata in every message: event type, schema version, producer service, timestamp, and correlation ID. This is your distributed trace for asynchronous flows.

When to Use / Avoid

Use WhenAvoid When
1 event must reach N independent consumersImmediate synchronous response is required
Services must scale and deploy independentlyStrong cross-service transactional consistency is needed
Producer availability must not depend on consumer availabilityTeam lacks experience debugging distributed, async flows
Event replay is needed for recovery or new feature backfillSimple point-to-point call between 2 services suffices