← System Design Observability
System Design

Anomaly Detection Systems

Anomaly detection replaces static thresholds with dynamic baselines that adapt to natural system seasonality.

TL;DR
  • Anomaly detection replaces static thresholds with dynamic baselines that adapt to natural system seasonality.
  • Standard deviation and rolling Z-scores are used to identify statistical outliers in high-volume metric streams.
  • Overly sensitive models generate massive alert noise, destroying engineer trust in the detection system.
  • Anomaly detection is highly effective for seasonal business metrics but fails on volatile, low-volume infrastructure metrics.

The Problem

Static thresholds are fundamentally incapable of monitoring systems with strong seasonal patterns. For example, a food delivery app experiences massive traffic spikes at 12 PM and 6 PM, and near-zero traffic at 3 AM. If an engineer sets a static low-traffic alert to catch database dropouts, it will trigger every night at 3 AM (false positive). If they lower the threshold to accommodate the night cycle, the system will fail to alert if traffic drops by 80% during the peak Friday night rush (false negative). Engineers are left constantly adjusting thresholds to match the day of the week or holidays.

Core System Idea

Anomaly detection systems solve this by calculating dynamic, seasonality-aware baselines. Instead of comparing a metric to a fixed number, the system compares the current value to historical behavior for that specific time window (e.g., comparing this Tuesday at 2 PM to the average of the last three Tuesdays at 2 PM).

The core architecture utilizes statistical models—such as rolling Z-scores, Holt-Winters exponential smoothing, or seasonal decomposition—to calculate a moving average and standard deviation.

The system defines an "envelope" or band of normal behavior (typically 2 to 3 standard deviations from the baseline). Any data point that falls outside this band is flagged as an anomaly.

To prevent alert storms, the system applies dampening filters, requiring a metric to remain anomalous for a minimum duration or across multiple related indicators before triggering an alert.

System Flow

flowchart TD A[Metric Stream] --> B["Data Pre-Processor: Clean and Align"] B --> C["Baseline Predictor: Historical Model"] C --> D["Calculate Dynamic Envelope: Std Dev"] A --> E[Anomaly Evaluator] D --> E E --> F{"Is Outlier?"} F -- "Yes and Sustained" --> G["Noise Filter: Dampening"] F -- "No" --> H[Update Model Weights] G --> I[Trigger Alert]

Real-time metrics are compared against a dynamically calculated historical baseline envelope, passing through noise filters before generating alerts.

Real-World Examples Indicative

Datadog Agile Algorithm at Shopify

Datadog's Agile anomaly algorithm uses Holt-Winters exponential smoothing with seasonality: weekly—it compares the current metric value to the same weekday and same hour across the 4 prior weeks. Sensitivity is configurable in standard deviations (default 3σ). Shopify uses this on checkout_completion_rate to catch Sunday-evening drops that static thresholds miss: a 12% dip at 8 PM Sunday is invisible against a global threshold but flagged because the same metric ran at 92% every prior Sunday at that hour.

Netflix Atlas with Kayvee

Netflix's Kayvee anomaly engine runs on top of their Atlas metrics platform, processing 1B+ metric data points/day. Kayvee uses DBSCAN clustering to group microservices with similar traffic signatures and evaluates each against a 4-week rolling baseline. When stream-start rates in a specific AWS region drop 15% below baseline for 3 consecutive minutes, Kayvee fires a geo-localized alert. This detected a CDN degradation in ap-southeast-1 that had no impact on global error rate—a cross-region static threshold would have stayed silent.

Meta Prophet for Business Metrics

Meta open-sourced Prophet in 2017, their production forecasting library for business time-series. Prophet uses additive decomposition (trend + weekly seasonality + yearly seasonality + holiday effects). Meta applies it to daily active users and ad impression volume where Black Friday creates step-changes that confuse naive Z-score models. Engineers add custom "regressor" variables (e.g., is_campaign=1) to explain expected step-changes in the baseline, preventing anomaly alerts from firing on planned marketing events.

Anti-Patterns

Applying to Low-Volume Metrics

Running anomaly detection on highly volatile, low-volume metrics (e.g., error counts that hover between 0 and 2). This leads to massive statistical noise and constant false alerts.

Ignoring Planned Events

Failing to account for scheduled events like marketing campaigns, load tests, or system maintenance. The system will flag these expected surges as critical anomalies.

Black-Box ML Obsession

Implementing complex, uninterpretable deep learning models for basic time-series alerting. When the model triggers an alert, engineers cannot understand why, leading them to ignore it.

No Model Feedback Loop

Failing to provide a mechanism for engineers to mark an anomaly as a false positive, causing the system to continue alerting on the same normal behavior indefinitely.

Design Tradeoffs

DimensionDynamic Anomaly DetectionStatic Threshold
Seasonal adaptabilityAutomatically accounts for daily, weekly, and holiday seasonality; no manual re-tuning required as traffic patterns shiftCannot adapt; requires constant manual threshold adjustment when business patterns change (nights, weekends, holidays)
Alert comprehensibilityCan be opaque; engineers need a baseline envelope visualization to understand why the model firedImmediately understandable; checkout_rate < 100 for 5 min requires no model knowledge to interpret
Low-volume suitabilityPoor; statistical models produce high variance and false positives on metrics with fewer than ~100 events/minWorks for any volume, including binary (0 or 1 per interval) metrics where statistical baselines are meaningless

Best Practices

Use for Business Metrics FirstApply anomaly detection to high-level business KPIs (e.g., signups per minute, checkout volume) where seasonality is highly predictable before applying it to infrastructure metrics.
Combine with Static Safety BoundsAlways pair anomaly detection with absolute static bounds (e.g., "Alert if disk space > 90% regardless of what the anomaly model predicts").
Implement Holiday CalendarsFeed holiday and special event calendars into your baseline models to prevent false alerts during known non-standard business days like Black Friday or major product launches.
Enforce a Training WindowEnsure your models have at least 14-28 days of historical data before enabling alerts, allowing the system to fully learn weekly and bi-weekly cycles.

When to Use / Avoid

Use WhenAvoid When
Monitoring high-volume, highly seasonal metrics where "normal" changes drastically by time of day or day of week.Monitoring critical system resources with hard physical limits (e.g., disk capacity, memory exhaustion).
You want to detect subtle, slow-drifting degradations that do not cross absolute static thresholds.Operating early-stage systems with highly volatile, unpredictable traffic patterns and no historical baseline.
Managing large-scale fleets of identical servers where manual threshold configuration per host is impossible.Monitoring low-throughput endpoints where a single request can cause 100% variance in the metric.