← System Design Data & Messaging Systems
System Design

Kafka Architecture

Scale horizontally by partitioning topics, allowing concurrent writes and parallel consumer execution.

TL;DR
  • Scale horizontally by partitioning topics, allowing concurrent writes and parallel consumer execution.
  • Prevent rebalance storms during rolling deployments by configuring static group membership and tuning timeouts.
  • Balance durability and latency by selecting appropriate producer acknowledgment levels (acks=all vs acks=1).
  • Manage disk space efficiently by choosing between time-based retention and log compaction for state recovery.

The Problem

Traditional message queues struggle to scale when throughput reaches hundreds of thousands of events per second, as they require complex state tracking per message (acknowledgments, locks, and queues). When scaling consumer groups, traditional brokers often suffer from high coordination overhead.

Additionally, during rolling deployments, frequent consumer joins and leaves trigger constant partition reassignments (rebalance storms), halting message consumption and causing massive latency spikes across the entire data pipeline.

Core System Idea

Kafka structures topics as append-only, distributed commit logs partitioned across a cluster of brokers. Each partition is an ordered, immutable sequence of records. Instead of the broker tracking which messages have been read by which consumers, consumers track their own position using an "offset." This design shifts the state-tracking burden from the broker to the clients, enabling O(1) disk reads and writes.

To scale consumption, Kafka groups consumers into "Consumer Groups." Each partition in a topic is assigned to exactly one consumer instance within a group, guaranteeing ordered processing per partition.

Durability is managed via partition replication: a single broker acts as the partition leader handling all reads and writes, while follower brokers replicate the log asynchronously or synchronously based on the producer's acknowledgment configuration.

System Flow

flowchart TD A[Producer] -->|"Write: Key=User123"| B["Partition 0 Leader"] A -->|"Write: Key=User456"| C["Partition 1 Leader"] B -->|"Sync Replication"| D["Partition 0 Follower"] C -->|"Sync Replication"| E["Partition 1 Follower"] B -->|"Pull Offset 42"| F["Consumer Group: Instance 1"] C -->|"Pull Offset 109"| G["Consumer Group: Instance 2"]

Partition-based message distribution, replication, and consumer group offset tracking.

Real-World Examples Indicative

LinkedIn activity_tracking at 7M events/sec

LinkedIn's activity_tracking topic is split across 1,400 partitions sustaining 7M+ events/sec from member activity (profile views, post likes, connection requests). Each partition is sized to ~10 MB/sec write throughput. Producers run acks=all with min.insync.replicas=2 — during a 2014 broker failure, no activity events were lost because the ISR failover completed before the leader returned an acknowledgment to the producer. This pattern now underpins 4.5T+ messages/day across LinkedIn's full Kafka cluster.

Uber city_id partitioning with static group membership

Uber partitions their marketplace_events topic by city_id, co-locating all ride requests, driver locations, and surge pricing events for a single city in one partition. This enables stateful Flink stream processing without cross-partition joins. Consumer pods are deployed with group.instance.id set to each pod's Kubernetes hostname. During a rolling restart of 300+ consumer pods, static membership prevents the group coordinator from triggering eager rebalances — avoiding a 5-minute consumption pause that would have cascaded into stale surge pricing calculations across active cities.

Netflix log-compacted user-profile-updates

Netflix applies log compaction to the user-profile-updates topic, which carries the latest streaming preferences, language settings, and parental controls for 200M+ subscribers. Because compaction retains only the most recent record per user_id key, the full subscriber state fits in ~2 TB of compressed storage instead of the ~40 TB that 7-day time-based retention would require. New consumer service instances bootstrap their local RocksDB state store by replaying the compacted topic from offset 0, consuming only one record per user rather than weeks of update history.

Anti-Patterns

Using a Single Partition for High Throughput

Creating topics with only one partition limits consumption to a single consumer thread, creating a hard bottleneck that cannot be scaled out.

Over-Partitioning the Cluster

Having hundreds of thousands of partitions per broker increases metadata overhead, slows down leader election during broker failures, and increases end-to-end latency.

Configuring `acks=1` for Critical Financial Data

Setting acknowledgments to 1 means the producer considers a write successful once the leader writes to local disk, risking data loss if the leader crashes before replicating to followers.

Ignoring Consumer Poll Timeouts

If a consumer takes longer to process a batch of messages than max.poll.interval.ms, the coordinator assumes the consumer is dead, kicks it out of the group, and triggers a rebalance storm.

Design Tradeoffs

DimensionLog CompactionTime-Based Retention
Disk efficiencyRetains only the latest value per key; total disk usage bounded by the size of current state, not message historyDisk usage grows linearly with ingestion rate until the retention window expires; suitable for event streams with no key-based state
Broker I/O overheadHigh; background cleaner threads continuously merge and rewrite log segments, consuming significant CPU and disk I/OLow; the broker deletes aged segments with a single O(1) file system operation, no read-modify-write cycle required
State recoveryIdeal for rebuilding application state (e.g., user preferences, account balances) by replaying only the latest record per keyNot suitable for state recovery; older intermediate states are permanently purged when segments expire
Recommended acksacks=all with min.insync.replicas=2; state updates must survive a leader crash without data lossacks=1 acceptable for high-throughput event streams (e.g., clickstreams, logs) where occasional loss is tolerable

Best Practices

Implement Static Group MembershipUse group.instance.id for consumers in containerized environments to prevent rebalances during rolling deployments when pods restart with the same identity.
Size Partitions Based on ThroughputAim for a target throughput of 10 MB/sec for writes and 20 MB/sec for reads per partition, and keep partition sizes under 50 GB.
Tune `max.poll.interval.ms`Set this configuration high enough to accommodate the worst-case processing time of a single batch of messages, preventing accidental consumer evictions.
Monitor Consumer LagAlert on consumer lag (the difference between the latest log offset and the consumer's committed offset) in terms of time, not just message count, to detect slow processing.
Configure `min.insync.replicas`Pair acks=all with a min.insync.replicas configuration of at least 2 to ensure writes are rejected if the cluster cannot guarantee replication durability.

When to Use / Avoid

Use WhenAvoid When
Processing high-throughput, real-time event streams (e.g., clickstreams, telemetry, log aggregation).You need a lightweight, simple message broker with minimal operational overhead.
Building event-sourced architectures where replayability of historical data is required.You require fine-grained, message-level routing and individual message acknowledgments.
Implementing distributed stream processing pipelines with stateful transformations.The system requires immediate, strict global ordering across all messages regardless of key.