Kafka Architecture
Scale horizontally by partitioning topics, allowing concurrent writes and parallel consumer execution.
- Scale horizontally by partitioning topics, allowing concurrent writes and parallel consumer execution.
- Prevent rebalance storms during rolling deployments by configuring static group membership and tuning timeouts.
- Balance durability and latency by selecting appropriate producer acknowledgment levels (
acks=allvsacks=1). - Manage disk space efficiently by choosing between time-based retention and log compaction for state recovery.
The Problem
Traditional message queues struggle to scale when throughput reaches hundreds of thousands of events per second, as they require complex state tracking per message (acknowledgments, locks, and queues). When scaling consumer groups, traditional brokers often suffer from high coordination overhead.
Additionally, during rolling deployments, frequent consumer joins and leaves trigger constant partition reassignments (rebalance storms), halting message consumption and causing massive latency spikes across the entire data pipeline.
Core System Idea
Kafka structures topics as append-only, distributed commit logs partitioned across a cluster of brokers. Each partition is an ordered, immutable sequence of records. Instead of the broker tracking which messages have been read by which consumers, consumers track their own position using an "offset." This design shifts the state-tracking burden from the broker to the clients, enabling O(1) disk reads and writes.
To scale consumption, Kafka groups consumers into "Consumer Groups." Each partition in a topic is assigned to exactly one consumer instance within a group, guaranteeing ordered processing per partition.
Durability is managed via partition replication: a single broker acts as the partition leader handling all reads and writes, while follower brokers replicate the log asynchronously or synchronously based on the producer's acknowledgment configuration.
System Flow
Partition-based message distribution, replication, and consumer group offset tracking.
Real-World Examples Indicative
LinkedIn's activity_tracking topic is split across 1,400 partitions sustaining 7M+ events/sec from member activity (profile views, post likes, connection requests). Each partition is sized to ~10 MB/sec write throughput. Producers run acks=all with min.insync.replicas=2 — during a 2014 broker failure, no activity events were lost because the ISR failover completed before the leader returned an acknowledgment to the producer. This pattern now underpins 4.5T+ messages/day across LinkedIn's full Kafka cluster.
Uber partitions their marketplace_events topic by city_id, co-locating all ride requests, driver locations, and surge pricing events for a single city in one partition. This enables stateful Flink stream processing without cross-partition joins. Consumer pods are deployed with group.instance.id set to each pod's Kubernetes hostname. During a rolling restart of 300+ consumer pods, static membership prevents the group coordinator from triggering eager rebalances — avoiding a 5-minute consumption pause that would have cascaded into stale surge pricing calculations across active cities.
Netflix applies log compaction to the user-profile-updates topic, which carries the latest streaming preferences, language settings, and parental controls for 200M+ subscribers. Because compaction retains only the most recent record per user_id key, the full subscriber state fits in ~2 TB of compressed storage instead of the ~40 TB that 7-day time-based retention would require. New consumer service instances bootstrap their local RocksDB state store by replaying the compacted topic from offset 0, consuming only one record per user rather than weeks of update history.
Anti-Patterns
Creating topics with only one partition limits consumption to a single consumer thread, creating a hard bottleneck that cannot be scaled out.
Having hundreds of thousands of partitions per broker increases metadata overhead, slows down leader election during broker failures, and increases end-to-end latency.
Setting acknowledgments to 1 means the producer considers a write successful once the leader writes to local disk, risking data loss if the leader crashes before replicating to followers.
If a consumer takes longer to process a batch of messages than max.poll.interval.ms, the coordinator assumes the consumer is dead, kicks it out of the group, and triggers a rebalance storm.
Design Tradeoffs
| Dimension | Log Compaction | Time-Based Retention |
|---|---|---|
| Disk efficiency | Retains only the latest value per key; total disk usage bounded by the size of current state, not message history | Disk usage grows linearly with ingestion rate until the retention window expires; suitable for event streams with no key-based state |
| Broker I/O overhead | High; background cleaner threads continuously merge and rewrite log segments, consuming significant CPU and disk I/O | Low; the broker deletes aged segments with a single O(1) file system operation, no read-modify-write cycle required |
| State recovery | Ideal for rebuilding application state (e.g., user preferences, account balances) by replaying only the latest record per key | Not suitable for state recovery; older intermediate states are permanently purged when segments expire |
| Recommended acks | acks=all with min.insync.replicas=2; state updates must survive a leader crash without data loss | acks=1 acceptable for high-throughput event streams (e.g., clickstreams, logs) where occasional loss is tolerable |
Best Practices
group.instance.id for consumers in containerized environments to prevent rebalances during rolling deployments when pods restart with the same identity.acks=all with a min.insync.replicas configuration of at least 2 to ensure writes are rejected if the cluster cannot guarantee replication durability.When to Use / Avoid
| Use When | Avoid When |
|---|---|
| Processing high-throughput, real-time event streams (e.g., clickstreams, telemetry, log aggregation). | You need a lightweight, simple message broker with minimal operational overhead. |
| Building event-sourced architectures where replayability of historical data is required. | You require fine-grained, message-level routing and individual message acknowledgments. |
| Implementing distributed stream processing pipelines with stateful transformations. | The system requires immediate, strict global ordering across all messages regardless of key. |