OpenTelemetry Architecture
OpenTelemetry (OTel) provides a unified, vendor-neutral standard for collecting traces, metrics, and logs.
- OpenTelemetry (OTel) provides a unified, vendor-neutral standard for collecting traces, metrics, and logs.
- The OTel Collector uses a pipeline architecture consisting of receivers, processors, and exporters.
- Offloading telemetry processing to a local collector sidecar minimizes application CPU and memory overhead.
- Memory-limiter and batch processors are critical to prevent the collector from crashing under high telemetry loads.
The Problem
Historically, adopting observability meant installing proprietary agent binaries from specific monitoring vendors on every host. Each vendor used its own closed-source SDKs, custom transport protocols, and unique data formats. If an organization wanted to switch vendors or use multiple tools, developers had to completely rewrite their instrumentation code, recompile their applications, and deploy new agents. This vendor lock-in stifled innovation, while running multiple competing agents on the same production hosts wasted valuable CPU and memory resources.
Core System Idea
OpenTelemetry solves this by providing a single, open-source, vendor-neutral standard for telemetry collection.
The architecture is split into two primary components: the OTel SDKs (run inside the application) and the OTel Collector (runs as an independent service).
The SDKs provide a unified API to instrument code for traces, metrics, and logs, exporting data using the standardized OpenTelemetry Protocol (OTLP).
The OTel Collector acts as a high-performance proxy designed around a modular pipeline: Receivers ingest data in various formats (OTLP, Jaeger, Prometheus), Processors modify the data (batching, filtering, PII redaction, metadata enrichment), and Exporters translate and send data to any backend (Honeycomb, Datadog, Prometheus, Jaeger).
By running the Collector as a local sidecar or daemon, applications can offload the expensive work of serializing and transmitting telemetry, protecting primary application performance.
System Flow
Telemetry flows from the application SDK to a local sidecar, then to a gateway collector cluster where it is processed, batched, and exported to specialized backends.
Real-World Examples Indicative
Skyscanner migrated their entire microservices observability stack from Jaeger's proprietary SDK to OTel Java and Python agents across 50+ services in 2021. The migration took 3 months and required zero application code changes—only collector configuration changed. Post-migration, they simultaneously route traces to both their existing Jaeger backend and a new Grafana Tempo instance by adding a second otlpexporter block in the gateway collector config, with no service redeployment required.
Honeycomb's open-source Refinery acts as an OTel-compatible gateway collector implementing tail-based sampling. The tail_sampling processor retains 100% of error traces (status_code = ERROR) and probabilistically samples successful traces at 1%. Refinery buffers all spans for a trace in a shared memory pool across its cluster nodes, using consistent hashing on Trace ID to ensure all spans for the same trace arrive at the same Refinery node. It processes 10B+ spans/day for customers like Slack and LaunchDarkly without proprietary SDK requirements.
eBay runs OTel Collectors as a Kubernetes DaemonSet on every node across their global fleet. Each node-level collector handles ~50K spans/second and configures memory_limiter with a 4GB hard limit and 3.5GB spike limit to prevent OOM kills during traffic surges. Batches are sent every 200ms or when send_batch_size: 10000 is reached. eBay routes metrics to Prometheus via prometheusremotewrite exporter and traces to an internal Jaeger-compatible backend—the same pipeline, two exporters, zero changes to application code.
Anti-Patterns
Configuring complex filtering, sampling, or tag-enrichment logic inside the application process. This consumes CPU cycles reserved for serving user requests.
Running the OTel Collector without configuring the memory_limiter processor. Under high load, the collector will consume unbounded memory and be OOM-killed by the host OS.
Using vendor-specific exporters inside the application SDK. Applications should always export OTLP to a local collector—this defeats vendor lock-in.
Running a single, unscaled collector instance for an entire Kubernetes cluster. The collector must be scaled horizontally with a load balancer to handle cluster-wide telemetry.
Design Tradeoffs
| Dimension | Sidecar/Agent Pattern | Gateway/Cluster Pattern |
|---|---|---|
| Export latency | Near-zero; telemetry sent via localhost socket eliminates network RTT between app and collector | Small additional hop; spans travel app → sidecar → gateway before processing, adding ~1ms per batch |
| Resource footprint | High per-node; a dedicated collector process on every pod consumes 100-200MB RAM across the full fleet | Low per app-node; collector resources consolidated in a shared gateway cluster that scales independently |
| Failure isolation | Strong; if the gateway goes down, the sidecar buffers locally with sending_queue and retries when the gateway recovers | Risk; a network partition between apps and the gateway causes data loss without an intermediate local buffer |
Best Practices
batch processor in your collector pipelines. Batching reduces outgoing network requests dramatically and improves downstream database ingestion performance.memory_limiter at the beginning of your collector pipelines to drop data gracefully rather than OOM-crash under spike load.When to Use / Avoid
| Use When | Avoid When |
|---|---|
| Building modern, multi-language microservice architectures where you want to avoid vendor lock-in and standardize telemetry collection. | Operating a simple, single-language monolith with no plans to ever switch or add monitoring vendors. |
| You need to correlate traces, metrics, and logs under a single, unified data model and shared Trace ID. | Working in extremely resource-constrained environments where the collector memory footprint is prohibitive. |
| You require advanced telemetry routing, filtering, or PII redaction before data leaves your network boundary. | Building quick prototypes or proof-of-concept applications where basic console logging is sufficient. |