← System Design Data & Messaging Systems
System Design

Stream Processing Architecture

Maintain local, high-performance state using embedded key-value stores to avoid database round-trip bottlenecks.

TL;DR
  • Maintain local, high-performance state using embedded key-value stores to avoid database round-trip bottlenecks.
  • Handle out-of-order and late-arriving events deterministically using event-time watermarks.
  • Match windowing strategies (tumbling, sliding, session) to the temporal patterns of your business metrics.
  • Achieve exactly-once processing semantics by combining transactional state checkpoints with idempotent sinks.

The Problem

Batch processing systems introduce hours of latency, making them useless for real-time fraud detection, system monitoring, or dynamic pricing. However, processing continuous data streams in real time introduces massive challenges: network delays cause events to arrive out of order, network partitions cause duplicate deliveries, and querying external databases for state enrichment chokes throughput.

Without a structured way to handle time and state, real-time dashboards display inaccurate, non-deterministic metrics that drift every time a node restarts.

Core System Idea

Modern stream processing architectures decouple processing logic from physical arrival time by using "event-time" (the timestamp embedded in the event itself) rather than "processing-time" (the clock time of the machine running the code).

To handle late-arriving data, the system emits "watermarks"—monotonically increasing timestamps that represent an assertion that no further records with timestamps prior to the watermark will arrive.

To perform stateful operations (like aggregations or joins) without hitting external databases, processing nodes maintain local state in high-performance, embedded key-value stores (e.g., RocksDB).

This state is periodically and asynchronously checkpointed to durable object storage using distributed snapshot algorithms (like Chandy-Lamport). If a node crashes, it recovers its local state from the latest checkpoint and replays the input stream from the corresponding offset.

System Flow

flowchart TD A["Event Stream"] -->|"Ingest with Event-Time"| B["Watermark Generator"] B -->|"Emit Watermark t=10"| C["Window Operator"] C -->|"Aggregate Local State"| D["RocksDB State Store"] D -->|"Async Checkpoint"| E["Durable Object Storage"] C -->|"Trigger Window Close"| F["Idempotent Sink"]

Stateful stream processing pipeline managing event-time watermarks, local state storage, checkpointing, and output sinks.

Real-World Examples Indicative

Alibaba Flink at Double 11 — 4.5T events/day

Alibaba's Flink deployment processes 4.5T+ events/day during Double 11 (Singles Day), sustaining 1B+ events in the first few minutes of the sale window. Flink's event-time watermarks with a 5-second allowedLateness buffer handle mobile clients that transmit click events in burst batches from poor cellular connections. End-to-end latency from event emission to real-time GMV dashboard stays under 3 seconds. Checkpoints are written to HDFS every 60 seconds; during a 2019 node failure mid-event, the pipeline recovered from the last checkpoint in under 90 seconds without reprocessing more than a 60-second event window.

Uber surge pricing with 30-second tumbling windows

Uber's surge pricing pipeline uses Flink with 30-second tumbling windows partitioned by city_id. Each window operator maintains a local RocksDB state store keyed by city, tracking ride request counts versus available driver positions. By co-partitioning the ride_requests and driver_positions Kafka topics by city_id, Flink avoids cross-task joins entirely — each task manager holds one city's complete state locally. This design sustains 500K+ events/sec with <200ms window close-to-pricing-update latency, directly feeding the surge multiplier shown to riders at the booking screen.

Twitter Heron at 2022 World Cup — 40M tweets/hr

Twitter ran Heron during the 2022 World Cup, handling 40M+ tweets/hr during peak match moments. Heron's process-topology model isolates container failures to individual spouts and bolts without cascading to other stages. During Argentina vs France (the most-tweeted match), Heron's backpressure mechanism automatically throttled upstream Kafka consumer rates when downstream sentiment scoring containers saturated CPU, maintaining consistent processing without dropping events — even as tweet volume spiked 8× in the 30 seconds following the final penalty kick.

Anti-Patterns

Querying External Databases Inside Stream Operators

Making synchronous HTTP or RPC calls to enrich events inside a high-throughput stream operator blocks the processing thread and destroys throughput.

Using Processing-Time for Financial Auditing

Relying on the processing machine's local clock leads to non-deterministic results when replaying historical data or during network delays.

Unbounded State Accumulation

Creating session windows or aggregations without configuring state Time-To-Live (TTL) or allowed lateness causes the local RocksDB instance to run out of disk space.

Ignoring Watermark Skew

Allowing a single idle partition to hold back the global watermark halts window evaluation across the entire pipeline, causing massive latency.

Design Tradeoffs

DimensionStateful Event-Time Processing (Flink)Stateless Processing-Time Processing (Lambda)
ThroughputHigh; local RocksDB avoids external DB round-trips, sustaining millions of events/sec per task managerLimited by external database queries; each event may require a network round-trip for state lookup
ConsistencyDeterministic; event-time watermarks guarantee the same output when replaying historical dataNon-deterministic; results vary with network latency, system load, and execution timing
Operational complexityHigh; requires cluster management, state migration planning, and checkpoint storage configurationLow; no state to manage or recover; scales automatically via FaaS runtime with no cluster to operate
Use case fitWindowed aggregations, stateful joins, fraud detection requiring historical context across eventsSimple event routing, transformation, or filtering with no cross-event state dependency

Best Practices

Use Idempotent SinksTo achieve true end-to-end exactly-once semantics, ensure your output sink supports idempotent writes (e.g., upserts into a database) or participates in a two-phase commit protocol.
Configure State TTLAlways set a reasonable Time-To-Live on stateful operations to automatically purge expired keys and prevent disk exhaustion.
Handle Late Data with Side OutputsInstead of setting an infinitely long watermark delay, route extremely late-arriving events to a dedicated "side output" stream for manual intervention or separate processing.
Tune RocksDB Memory AllocationAllocate sufficient off-heap memory for RocksDB block caches to prevent expensive reads from local SSDs during state lookups.
Implement Idle Source DetectionConfigure your watermark generator to mark inactive input partitions as "idle" so they do not block the global watermark from advancing.

When to Use / Avoid

Use WhenAvoid When
Building real-time alerting systems, fraud detection engines, or live analytics dashboards.Processing batch jobs where latency is not a factor and resource efficiency is the primary goal.
Joining multiple high-volume event streams within specific time windows.Performing complex, ad-hoc exploratory queries across historical terabytes of data.
Calculating rolling metrics (e.g., 5-minute moving averages) continuously.The business logic requires synchronous, immediate request-response validation.