Observability · #46

OpenTelemetry Architecture

OpenTelemetry (OTel) provides a unified, vendor-neutral standard for collecting traces, metrics, and logs.

Published May 29, 2026 · By MortalApps · 4 min read · 867 words

TL;DR

OpenTelemetry (OTel) provides a unified, vendor-neutral standard for collecting traces, metrics, and logs.
The OTel Collector uses a pipeline architecture consisting of receivers, processors, and exporters.
Offloading telemetry processing to a local collector sidecar minimizes application CPU and memory overhead.
Memory-limiter and batch processors are critical to prevent the collector from crashing under high telemetry loads.

Problem Idea Flow Examples Anti-patterns Tradeoffs Best Practices Related

The Problem

Historically, adopting observability meant installing proprietary agent binaries from specific monitoring vendors on every host. Each vendor used its own closed-source SDKs, custom transport protocols, and unique data formats. If an organization wanted to switch vendors or use multiple tools, developers had to completely rewrite their instrumentation code, recompile their applications, and deploy new agents. This vendor lock-in stifled innovation, while running multiple competing agents on the same production hosts wasted valuable CPU and memory resources.

Core System Idea

OpenTelemetry solves this by providing a single, open-source, vendor-neutral standard for telemetry collection.

The architecture is split into two primary components: the OTel SDKs (run inside the application) and the OTel Collector (runs as an independent service).

The SDKs provide a unified API to instrument code for traces, metrics, and logs, exporting data using the standardized OpenTelemetry Protocol (OTLP).

The OTel Collector acts as a high-performance proxy designed around a modular pipeline: Receivers ingest data in various formats (OTLP, Jaeger, Prometheus), Processors modify the data (batching, filtering, PII redaction, metadata enrichment), and Exporters translate and send data to any backend (Honeycomb, Datadog, Prometheus, Jaeger).

By running the Collector as a local sidecar or daemon, applications can offload the expensive work of serializing and transmitting telemetry, protecting primary application performance.

System Flow

flowchart TD A["App Code: OTel SDK"] -- "OTLP over gRPC" --> B[Local Collector Sidecar] B -- "OTLP" --> C[Gateway Collector Cluster] C --> D["Receiver: Ingest OTLP"] D --> E["Processor: Batch and Memory Limit"] E --> F["Processor: Redact PII"] F --> G["Exporter: Translate Format"] G --> H["Backend A: Metrics"] G --> I["Backend B: Traces"]

Telemetry flows from the application SDK to a local sidecar, then to a gateway collector cluster where it is processed, batched, and exported to specialized backends.

Real-World Examples Indicative

Skyscanner OTel Migration

Skyscanner migrated their entire microservices observability stack from Jaeger's proprietary SDK to OTel Java and Python agents across 50+ services in 2021. The migration took 3 months and required zero application code changes—only collector configuration changed. Post-migration, they simultaneously route traces to both their existing Jaeger backend and a new Grafana Tempo instance by adding a second otlpexporter block in the gateway collector config, with no service redeployment required.

Honeycomb Refinery (Tail-Based Sampling Gateway)

Honeycomb's open-source Refinery acts as an OTel-compatible gateway collector implementing tail-based sampling. The tail_sampling processor retains 100% of error traces (status_code = ERROR) and probabilistically samples successful traces at 1%. Refinery buffers all spans for a trace in a shared memory pool across its cluster nodes, using consistent hashing on Trace ID to ensure all spans for the same trace arrive at the same Refinery node. It processes 10B+ spans/day for customers like Slack and LaunchDarkly without proprietary SDK requirements.

eBay OTel DaemonSet at Scale

eBay runs OTel Collectors as a Kubernetes DaemonSet on every node across their global fleet. Each node-level collector handles ~50K spans/second and configures memory_limiter with a 4GB hard limit and 3.5GB spike limit to prevent OOM kills during traffic surges. Batches are sent every 200ms or when send_batch_size: 10000 is reached. eBay routes metrics to Prometheus via prometheusremotewrite exporter and traces to an internal Jaeger-compatible backend—the same pipeline, two exporters, zero changes to application code.

Anti-Patterns

Heavy Processing in the App SDK

Configuring complex filtering, sampling, or tag-enrichment logic inside the application process. This consumes CPU cycles reserved for serving user requests.

Neglecting the Memory Limiter

Running the OTel Collector without configuring the memory_limiter processor. Under high load, the collector will consume unbounded memory and be OOM-killed by the host OS.

Hardcoding Vendor Exporters in Code

Using vendor-specific exporters inside the application SDK. Applications should always export OTLP to a local collector—this defeats vendor lock-in.

Ignoring Collector Scaling

Running a single, unscaled collector instance for an entire Kubernetes cluster. The collector must be scaled horizontally with a load balancer to handle cluster-wide telemetry.

Design Tradeoffs

Dimension	Sidecar/Agent Pattern	Gateway/Cluster Pattern
Export latency	Near-zero; telemetry sent via localhost socket eliminates network RTT between app and collector	Small additional hop; spans travel app → sidecar → gateway before processing, adding ~1ms per batch
Resource footprint	High per-node; a dedicated collector process on every pod consumes 100-200MB RAM across the full fleet	Low per app-node; collector resources consolidated in a shared gateway cluster that scales independently
Failure isolation	Strong; if the gateway goes down, the sidecar buffers locally with `sending_queue` and retries when the gateway recovers	Risk; a network partition between apps and the gateway causes data loss without an intermediate local buffer

Best Practices

Deploy Both Sidecars and GatewaysUse local sidecars for fast, non-blocking application exports, and route them to a centralized gateway cluster for heavy processing, aggregation, and multi-backend routing.

Always Enable BatchingConfigure the batch processor in your collector pipelines. Batching reduces outgoing network requests dramatically and improves downstream database ingestion performance.

Configure the Memory Limiter FirstPlace memory_limiter at the beginning of your collector pipelines to drop data gracefully rather than OOM-crash under spike load.

Leverage Auto-InstrumentationUse OTel's runtime auto-instrumentation agents (Java, Node.js, Python) to instantly capture framework-level telemetry without writing manual instrumentation code.

When to Use / Avoid

Use When	Avoid When
Building modern, multi-language microservice architectures where you want to avoid vendor lock-in and standardize telemetry collection.	Operating a simple, single-language monolith with no plans to ever switch or add monitoring vendors.
You need to correlate traces, metrics, and logs under a single, unified data model and shared Trace ID.	Working in extremely resource-constrained environments where the collector memory footprint is prohibitive.
You require advanced telemetry routing, filtering, or PII redaction before data leaves your network boundary.	Building quick prototypes or proof-of-concept applications where basic console logging is sufficient.