Log Aggregation Architecture
Log aggregation pipelines collect, parse, buffer, index, and store logs from distributed application nodes.
- Log aggregation pipelines collect, parse, buffer, index, and store logs from distributed application nodes.
- A persistent message broker (e.g., Kafka) is required to act as a buffer and prevent backpressure from crashing upstream applications.
- Parsing logs at ingestion optimizes search performance but increases write latency and ingestion pipeline complexity.
- Tiered storage (hot, warm, cold) is critical to control the massive financial costs of long-term log retention.
The Problem
During a major production outage, application nodes typically experience a massive spike in error rates, causing them to generate logs at 10x to 100x their normal volume. If the log aggregation pipeline is built as a direct, synchronous path from the application to the indexing engine, this sudden surge will overwhelm the indexer. The indexing cluster will slow down, experience out-of-memory errors, or crash entirely. Consequently, at the exact moment engineers desperately need logs to debug the outage, the logging system is completely unavailable, leaving them blind.
Core System Idea
A resilient log aggregation architecture decouples log generation from log indexing using an asynchronous, buffered pipeline.
The architecture consists of four main tiers: Shippers, Brokers, Parsers, and Indexers.
Lightweight log shippers (e.g., Fluent Bit, Vector) run on the application hosts, reading logs from stdout or local files with minimal CPU overhead. These shippers immediately forward the logs to a highly available, distributed message broker (e.g., Apache Kafka or Pulsar). The broker acts as a shock absorber, buffering massive spikes in log volume without passing the load downstream.
Parser/Transformer workers pull logs from the broker, parse them into structured formats, enrich them with metadata, and write them to the indexing engine (e.g., OpenSearch, Elasticsearch) in batches.
Finally, a tiered storage strategy moves old logs from expensive NVMe-backed indexers to cheap object storage (e.g., AWS S3) over time.
System Flow
Log data is buffered through a message broker to absorb spikes, parsed asynchronously, and indexed before transitioning through tiered storage to minimize costs.
Real-World Examples Indicative
LinkedIn created Apache Kafka in 2011 to handle 1B+ activity events/day from their internal tracking systems, replacing a custom AMQP broker that couldn't absorb log bursts during traffic spikes. Kafka's append-only log model allowed LinkedIn's pipeline to replay any window of application logs for incident reconstruction. Today LinkedIn processes 4.5T+ messages/day through Kafka, with log data flowing asynchronously to HDFS for cold retention and OpenSearch for hot indexing—the broker decoupling ensures indexer slowdowns never backpressure into application hosts.
Okta processes 50B+ authentication events/month through Logstash → Elasticsearch. Their Logstash pipeline enriches each auth event with geolocation (MaxMind GeoIP), user risk score, and device fingerprint before indexing. Their ILM policy runs three tiers: hot (NVMe SSD, 7 days, full index), warm (HDD, 30 days, read-only), cold (S3 Searchable Snapshots, 1 year). Moving from warm to S3 Searchable Snapshots cut storage costs by ~60% while keeping logs queryable from Kibana without a full restore.
Cloudflare processes 50M+ log events/second across their global edge network using Grafana Loki backed by S3 + Parquet instead of Elasticsearch. Loki stores logs as raw compressed chunks with only label dimensions indexed (no full-text inverted index), reducing index storage by ~90% versus Elasticsearch at the cost of slower ad-hoc full-text queries. Engineers use LogQL for operational debugging and Parquet-columnar queries via Thanos for long-retention compliance analysis—the same raw bytes, two query interfaces.
Anti-Patterns
Configuring applications to write logs directly to Elasticsearch or an external API without a buffer. A network hiccup or indexer slowdown will immediately cause application threads to block or crash.
Keeping weeks of raw logs in expensive, highly indexed hot storage. This leads to astronomical cloud bills and severely degrades search performance as shard count explodes.
Running complex regular expression parsing inside the application process or local shipper. This steals CPU cycles from the primary application logic under load.
Allowing log shippers to continue reading logs when the downstream broker is full, leading to host disk exhaustion or silent data loss.
Design Tradeoffs
| Dimension | Parse-on-Ingest (Structured) | Parse-on-Query (Raw Storage) |
|---|---|---|
| Write throughput | Lower; each event requires JSON parsing and field extraction before indexing, adding latency to the pipeline | Higher; raw bytes written directly to S3 or Loki chunks with minimal transformation overhead |
| Query speed | Instant; pre-indexed fields support O(1) key lookups across billions of events in Elasticsearch | Slow; full-scan parsing at query time; Loki LogQL scans compressed chunks, taking 10-30s for complex filters |
| Schema flexibility | Rigid; schema changes require updating the Logstash/Vector parser config and potentially re-indexing historical data | Flexible; any log format is stored as-is; parsing logic is updated at query time without data migration |
Best Practices
When to Use / Avoid
| Use When | Avoid When |
|---|---|
| Operating large-scale, distributed microservices where centralized search, auditing, and compliance are mandatory. | Running a small, monolithic application on a single server where local log rotation and grep are sufficient. |
| You need to build real-time security monitoring, intrusion detection, or operational alerting on log data. | Operating in extremely low-bandwidth environments (e.g., remote IoT devices) where sending raw logs is cost-prohibitive. |
| Multiple teams need to query and correlate logs from different parts of a shared infrastructure. | High-performance systems where the cost of transmitting and storing logs exceeds the value of the telemetry. |