Observability · #41

Metrics and Alerting Design

Metrics must be categorized into counters (monotonic increases), gauges (variable values), and histograms (distribution of values).

Published May 29, 2026 · By MortalApps · 4 min read · 862 words

TL;DR

Metrics must be categorized into counters (monotonic increases), gauges (variable values), and histograms (distribution of values).
Cardinality explosion occurs when high-cardinality values like user IDs or IP addresses are used as metric labels, crashing the TSDB.
Alerting should focus on customer-impacting symptoms (e.g., high error rates) rather than underlying causes (e.g., high CPU).
Alert fatigue is mitigated by routing non-actionable alerts to ticket queues or chat channels instead of paging on-call engineers.

Problem Idea Flow Examples Anti-patterns Tradeoffs Best Practices Related

The Problem

When production systems fail, engineers are often buried under a mountain of noisy, redundant alerts. Traditional alerting setups rely on static, cause-based thresholds (e.g., "CPU usage > 80% on host X"). However, high CPU usage is often normal behavior during batch jobs and does not mean users are experiencing errors. Conversely, when a critical database connection pool is exhausted, CPU might remain low while user-facing error rates spike to 100%—yet no alert triggers because no one set a threshold for connection pool exhaustion. Furthermore, developers often add high-cardinality labels to metrics, causing the Time Series Database (TSDB) to run out of memory and crash during an active incident.

Core System Idea

An effective metrics and alerting system relies on structured metric types and symptom-based alerting. Metrics are categorized into three primary types: Counters (which only go up, used to calculate rates), Gauges (which go up and down, used for current state like memory usage), and Histograms (which measure the statistical distribution of values, critical for latencies and percentiles like P99).

To prevent TSDB performance degradation, labels (dimensions) must be strictly controlled. High-cardinality data (user IDs, request paths, trace IDs) must be kept out of the metrics pipeline entirely.

Alerting rules are then built on top of these metrics, prioritizing symptoms over causes. Instead of alerting on disk space or CPU, alerts trigger when the user-facing error rate or latency exceeds acceptable thresholds (Service Level Indicators). Cause-based metrics are reserved for post-alert triage and debugging, not for initial paging.

System Flow

flowchart TD A["App Code: Emit Metric"] --> B["Metric Registry: Aggregate"] B --> C["TSDB: Pull/Push Ingestion"] C --> D{"Alert Rules Engine"} D -- "Symptom Violated" --> E["PagerDuty: High Priority"] D -- "Cause/Warning" --> F["Slack/Ticket: Low Priority"] C --> G["Grafana: Triage Dashboard"]

Metrics flow from applications to a TSDB, where a rules engine evaluates symptom-based alerts for paging, while cause-based metrics are routed to dashboards and low-priority channels.

Real-World Examples Indicative

SoundCloud and Prometheus

SoundCloud engineers built Prometheus in 2012 after finding that Graphite + StatsD couldn't support label-based filtering across microservices. Prometheus's pull model scrapes /metrics endpoints every 15 seconds. SoundCloud standardized on RED (Rate, Errors, Duration) for all HTTP services—PromQL query rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01 pages for >1% error rate, replacing 200+ cause-based CPU and disk alerts with 12 symptom-based ones.

Uber M3DB

Uber built M3DB to handle 500M+ metric data points/second at peak across their global fleet. The cardinality problem: naive Prometheus labeling with driver_id or trip_id labels would create billions of unique time series per metric name. M3DB uses a three-tier aggregation pipeline: raw 10-second metrics downsampled to 1-minute aggregates after 2 hours, then 10-minute aggregates after 24 hours, while raw data is dropped—reducing storage by 90% without losing trend visibility.

Datadog at Shopify

Shopify uses Datadog with strict label enforcement enforced in CI. In 2019, a developer added request_path as a label on an HTTP duration histogram—the path included URL-encoded query strings, creating 10M+ unique time series in under 5 minutes, saturating Datadog's ingest pipeline and triggering a partial metrics outage during Cyber Monday. Shopify now runs datadog-ci metric cardinality checks in PRs, rejecting any metric whose projected label combinations exceed 10,000.

Anti-Patterns

Cardinality Explosion

Adding dynamic values like user_id, email, uuid, or raw url paths as metric labels. This creates millions of unique time series, exhausting TSDB memory.

Paging on CPU/Memory Spikes

Setting high-priority pages for transient CPU or memory utilization. This trains engineers to ignore alerts, leading to missed real outages.

Alerting on Single Failures

Triggering a page the moment a single request fails. Networks are inherently lossy; alerts must be based on sustained rates or percentages over time.

Hardcoded Thresholds in Code

Embedding alert thresholds directly in application code rather than configuring them declaratively in the metrics/alerting platform.

Design Tradeoffs

Dimension	Symptom-Based Alerting	Cause-Based Alerting
Alert quality	Pages fire when users are actually impacted; low false-positive rate keeps on-call engineers responsive	Pages fire on resource thresholds (CPU, disk) that may not affect users; high false-positive rate causes alert fatigue
Triage speed	Alert title communicates user impact directly; engineer goes straight to a runbook for that symptom	Alert names a resource; engineer must first determine whether it is causing user-visible impact before acting
Detection timing	Detects impact as it happens, but symptoms lag causes by seconds; misses impending failures before users notice	Detects infrastructure anomalies early before user impact—useful for capacity planning and proactive remediation

Best Practices

Use the RED or USE MethodStandardize on RED (Rate, Errors, Duration) for request-driven services and USE (Utilization, Saturation, Errors) for infrastructure resources like disks and memory.

Enforce Label WhitelistingImplement CI/CD linting or gateway filters to block metrics that contain high-cardinality labels before they reach the TSDB.

Calculate Percentiles CorrectlyNever average latencies. Use histograms to calculate true P95 or P99 latencies to expose the experience of frustrated users, not the median.

Make Alerts ActionableEvery page must include a link to a specific runbook detailing exactly how to diagnose and resolve the issue—pages without runbooks produce on-call thrashing.

When to Use / Avoid

Use When	Avoid When
You need real-time, low-overhead visibility into system health, throughput, and performance trends.	You need to reconstruct the exact sequence of events for a specific user transaction (use tracing and logs instead).
Designing automated auto-scaling rules based on system load or queue depth.	Storing audit trails or compliance data where every single event must be preserved.
Establishing high-level service level objectives (SLOs) for business stakeholders.	Debugging deep, complex code-level bugs that require inspecting local variable states.