← System Design Observability
System Design

Log Aggregation Architecture

Log aggregation pipelines collect, parse, buffer, index, and store logs from distributed application nodes.

TL;DR
  • Log aggregation pipelines collect, parse, buffer, index, and store logs from distributed application nodes.
  • A persistent message broker (e.g., Kafka) is required to act as a buffer and prevent backpressure from crashing upstream applications.
  • Parsing logs at ingestion optimizes search performance but increases write latency and ingestion pipeline complexity.
  • Tiered storage (hot, warm, cold) is critical to control the massive financial costs of long-term log retention.

The Problem

During a major production outage, application nodes typically experience a massive spike in error rates, causing them to generate logs at 10x to 100x their normal volume. If the log aggregation pipeline is built as a direct, synchronous path from the application to the indexing engine, this sudden surge will overwhelm the indexer. The indexing cluster will slow down, experience out-of-memory errors, or crash entirely. Consequently, at the exact moment engineers desperately need logs to debug the outage, the logging system is completely unavailable, leaving them blind.

Core System Idea

A resilient log aggregation architecture decouples log generation from log indexing using an asynchronous, buffered pipeline.

The architecture consists of four main tiers: Shippers, Brokers, Parsers, and Indexers.

Lightweight log shippers (e.g., Fluent Bit, Vector) run on the application hosts, reading logs from stdout or local files with minimal CPU overhead. These shippers immediately forward the logs to a highly available, distributed message broker (e.g., Apache Kafka or Pulsar). The broker acts as a shock absorber, buffering massive spikes in log volume without passing the load downstream.

Parser/Transformer workers pull logs from the broker, parse them into structured formats, enrich them with metadata, and write them to the indexing engine (e.g., OpenSearch, Elasticsearch) in batches.

Finally, a tiered storage strategy moves old logs from expensive NVMe-backed indexers to cheap object storage (e.g., AWS S3) over time.

System Flow

flowchart TD A[App Node] --> B["Local Shipper: Vector/Fluent Bit"] B --> C["Message Broker: Kafka Buffer"] C --> D["Parser/Transformer Workers"] D --> E["Indexing Engine: OpenSearch"] E --> F["Hot Storage: NVMe"] F -- "Age over 3 Days" --> G["Warm Storage: Attached Disk"] G -- "Age over 14 Days" --> H["Cold Storage: S3 Object"]

Log data is buffered through a message broker to absorb spikes, parsed asynchronously, and indexed before transitioning through tiered storage to minimize costs.

Real-World Examples Indicative

LinkedIn Kafka Origin

LinkedIn created Apache Kafka in 2011 to handle 1B+ activity events/day from their internal tracking systems, replacing a custom AMQP broker that couldn't absorb log bursts during traffic spikes. Kafka's append-only log model allowed LinkedIn's pipeline to replay any window of application logs for incident reconstruction. Today LinkedIn processes 4.5T+ messages/day through Kafka, with log data flowing asynchronously to HDFS for cold retention and OpenSearch for hot indexing—the broker decoupling ensures indexer slowdowns never backpressure into application hosts.

Okta ELK with Index Lifecycle Management

Okta processes 50B+ authentication events/month through Logstash → Elasticsearch. Their Logstash pipeline enriches each auth event with geolocation (MaxMind GeoIP), user risk score, and device fingerprint before indexing. Their ILM policy runs three tiers: hot (NVMe SSD, 7 days, full index), warm (HDD, 30 days, read-only), cold (S3 Searchable Snapshots, 1 year). Moving from warm to S3 Searchable Snapshots cut storage costs by ~60% while keeping logs queryable from Kibana without a full restore.

Cloudflare Grafana Loki

Cloudflare processes 50M+ log events/second across their global edge network using Grafana Loki backed by S3 + Parquet instead of Elasticsearch. Loki stores logs as raw compressed chunks with only label dimensions indexed (no full-text inverted index), reducing index storage by ~90% versus Elasticsearch at the cost of slower ad-hoc full-text queries. Engineers use LogQL for operational debugging and Parquet-columnar queries via Thanos for long-retention compliance analysis—the same raw bytes, two query interfaces.

Anti-Patterns

Direct-to-Indexer Writing

Configuring applications to write logs directly to Elasticsearch or an external API without a buffer. A network hiccup or indexer slowdown will immediately cause application threads to block or crash.

Infinite Hot Retention

Keeping weeks of raw logs in expensive, highly indexed hot storage. This leads to astronomical cloud bills and severely degrades search performance as shard count explodes.

Heavy Regex Parsing on the App Host

Running complex regular expression parsing inside the application process or local shipper. This steals CPU cycles from the primary application logic under load.

Ignoring Backpressure Signals

Allowing log shippers to continue reading logs when the downstream broker is full, leading to host disk exhaustion or silent data loss.

Design Tradeoffs

DimensionParse-on-Ingest (Structured)Parse-on-Query (Raw Storage)
Write throughputLower; each event requires JSON parsing and field extraction before indexing, adding latency to the pipelineHigher; raw bytes written directly to S3 or Loki chunks with minimal transformation overhead
Query speedInstant; pre-indexed fields support O(1) key lookups across billions of events in ElasticsearchSlow; full-scan parsing at query time; Loki LogQL scans compressed chunks, taking 10-30s for complex filters
Schema flexibilityRigid; schema changes require updating the Logstash/Vector parser config and potentially re-indexing historical dataFlexible; any log format is stored as-is; parsing logic is updated at query time without data migration

Best Practices

Implement Local Disk BufferingConfigure local log shippers to write to a local disk ring-buffer if they lose connection to the central broker, preventing application thread blockage.
Enforce Ingestion Rate-LimitingApply rate limits per service or container at the ingestion gateway to prevent a single misconfigured application from exhausting the entire pipeline's capacity.
Use Bulk Indexing APIsAlways write logs to the indexing engine in batches (e.g., every 1-5MB or every 5 seconds) rather than sending individual HTTP requests for every log line.
Automate Index Lifecycle Management (ILM)Set up strict ILM policies to automatically roll over indexes, transition them to warm/cold storage, and delete them after a defined retention period.

When to Use / Avoid

Use WhenAvoid When
Operating large-scale, distributed microservices where centralized search, auditing, and compliance are mandatory.Running a small, monolithic application on a single server where local log rotation and grep are sufficient.
You need to build real-time security monitoring, intrusion detection, or operational alerting on log data.Operating in extremely low-bandwidth environments (e.g., remote IoT devices) where sending raw logs is cost-prohibitive.
Multiple teams need to query and correlate logs from different parts of a shared infrastructure.High-performance systems where the cost of transmitting and storing logs exceeds the value of the telemetry.