Metrics and Alerting Design
Metrics must be categorized into counters (monotonic increases), gauges (variable values), and histograms (distribution of values).
- Metrics must be categorized into counters (monotonic increases), gauges (variable values), and histograms (distribution of values).
- Cardinality explosion occurs when high-cardinality values like user IDs or IP addresses are used as metric labels, crashing the TSDB.
- Alerting should focus on customer-impacting symptoms (e.g., high error rates) rather than underlying causes (e.g., high CPU).
- Alert fatigue is mitigated by routing non-actionable alerts to ticket queues or chat channels instead of paging on-call engineers.
The Problem
When production systems fail, engineers are often buried under a mountain of noisy, redundant alerts. Traditional alerting setups rely on static, cause-based thresholds (e.g., "CPU usage > 80% on host X"). However, high CPU usage is often normal behavior during batch jobs and does not mean users are experiencing errors. Conversely, when a critical database connection pool is exhausted, CPU might remain low while user-facing error rates spike to 100%—yet no alert triggers because no one set a threshold for connection pool exhaustion. Furthermore, developers often add high-cardinality labels to metrics, causing the Time Series Database (TSDB) to run out of memory and crash during an active incident.
Core System Idea
An effective metrics and alerting system relies on structured metric types and symptom-based alerting. Metrics are categorized into three primary types: Counters (which only go up, used to calculate rates), Gauges (which go up and down, used for current state like memory usage), and Histograms (which measure the statistical distribution of values, critical for latencies and percentiles like P99).
To prevent TSDB performance degradation, labels (dimensions) must be strictly controlled. High-cardinality data (user IDs, request paths, trace IDs) must be kept out of the metrics pipeline entirely.
Alerting rules are then built on top of these metrics, prioritizing symptoms over causes. Instead of alerting on disk space or CPU, alerts trigger when the user-facing error rate or latency exceeds acceptable thresholds (Service Level Indicators). Cause-based metrics are reserved for post-alert triage and debugging, not for initial paging.
System Flow
Metrics flow from applications to a TSDB, where a rules engine evaluates symptom-based alerts for paging, while cause-based metrics are routed to dashboards and low-priority channels.
Real-World Examples Indicative
SoundCloud engineers built Prometheus in 2012 after finding that Graphite + StatsD couldn't support label-based filtering across microservices. Prometheus's pull model scrapes /metrics endpoints every 15 seconds. SoundCloud standardized on RED (Rate, Errors, Duration) for all HTTP services—PromQL query rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01 pages for >1% error rate, replacing 200+ cause-based CPU and disk alerts with 12 symptom-based ones.
Uber built M3DB to handle 500M+ metric data points/second at peak across their global fleet. The cardinality problem: naive Prometheus labeling with driver_id or trip_id labels would create billions of unique time series per metric name. M3DB uses a three-tier aggregation pipeline: raw 10-second metrics downsampled to 1-minute aggregates after 2 hours, then 10-minute aggregates after 24 hours, while raw data is dropped—reducing storage by 90% without losing trend visibility.
Shopify uses Datadog with strict label enforcement enforced in CI. In 2019, a developer added request_path as a label on an HTTP duration histogram—the path included URL-encoded query strings, creating 10M+ unique time series in under 5 minutes, saturating Datadog's ingest pipeline and triggering a partial metrics outage during Cyber Monday. Shopify now runs datadog-ci metric cardinality checks in PRs, rejecting any metric whose projected label combinations exceed 10,000.
Anti-Patterns
Adding dynamic values like user_id, email, uuid, or raw url paths as metric labels. This creates millions of unique time series, exhausting TSDB memory.
Setting high-priority pages for transient CPU or memory utilization. This trains engineers to ignore alerts, leading to missed real outages.
Triggering a page the moment a single request fails. Networks are inherently lossy; alerts must be based on sustained rates or percentages over time.
Embedding alert thresholds directly in application code rather than configuring them declaratively in the metrics/alerting platform.
Design Tradeoffs
| Dimension | Symptom-Based Alerting | Cause-Based Alerting |
|---|---|---|
| Alert quality | Pages fire when users are actually impacted; low false-positive rate keeps on-call engineers responsive | Pages fire on resource thresholds (CPU, disk) that may not affect users; high false-positive rate causes alert fatigue |
| Triage speed | Alert title communicates user impact directly; engineer goes straight to a runbook for that symptom | Alert names a resource; engineer must first determine whether it is causing user-visible impact before acting |
| Detection timing | Detects impact as it happens, but symptoms lag causes by seconds; misses impending failures before users notice | Detects infrastructure anomalies early before user impact—useful for capacity planning and proactive remediation |
Best Practices
When to Use / Avoid
| Use When | Avoid When |
|---|---|
| You need real-time, low-overhead visibility into system health, throughput, and performance trends. | You need to reconstruct the exact sequence of events for a specific user transaction (use tracing and logs instead). |
| Designing automated auto-scaling rules based on system load or queue depth. | Storing audit trails or compliance data where every single event must be preserved. |
| Establishing high-level service level objectives (SLOs) for business stakeholders. | Debugging deep, complex code-level bugs that require inspecting local variable states. |