Anomaly Detection Systems
Anomaly detection replaces static thresholds with dynamic baselines that adapt to natural system seasonality.
- Anomaly detection replaces static thresholds with dynamic baselines that adapt to natural system seasonality.
- Standard deviation and rolling Z-scores are used to identify statistical outliers in high-volume metric streams.
- Overly sensitive models generate massive alert noise, destroying engineer trust in the detection system.
- Anomaly detection is highly effective for seasonal business metrics but fails on volatile, low-volume infrastructure metrics.
The Problem
Static thresholds are fundamentally incapable of monitoring systems with strong seasonal patterns. For example, a food delivery app experiences massive traffic spikes at 12 PM and 6 PM, and near-zero traffic at 3 AM. If an engineer sets a static low-traffic alert to catch database dropouts, it will trigger every night at 3 AM (false positive). If they lower the threshold to accommodate the night cycle, the system will fail to alert if traffic drops by 80% during the peak Friday night rush (false negative). Engineers are left constantly adjusting thresholds to match the day of the week or holidays.
Core System Idea
Anomaly detection systems solve this by calculating dynamic, seasonality-aware baselines. Instead of comparing a metric to a fixed number, the system compares the current value to historical behavior for that specific time window (e.g., comparing this Tuesday at 2 PM to the average of the last three Tuesdays at 2 PM).
The core architecture utilizes statistical models—such as rolling Z-scores, Holt-Winters exponential smoothing, or seasonal decomposition—to calculate a moving average and standard deviation.
The system defines an "envelope" or band of normal behavior (typically 2 to 3 standard deviations from the baseline). Any data point that falls outside this band is flagged as an anomaly.
To prevent alert storms, the system applies dampening filters, requiring a metric to remain anomalous for a minimum duration or across multiple related indicators before triggering an alert.
System Flow
Real-time metrics are compared against a dynamically calculated historical baseline envelope, passing through noise filters before generating alerts.
Real-World Examples Indicative
Datadog's Agile anomaly algorithm uses Holt-Winters exponential smoothing with seasonality: weekly—it compares the current metric value to the same weekday and same hour across the 4 prior weeks. Sensitivity is configurable in standard deviations (default 3σ). Shopify uses this on checkout_completion_rate to catch Sunday-evening drops that static thresholds miss: a 12% dip at 8 PM Sunday is invisible against a global threshold but flagged because the same metric ran at 92% every prior Sunday at that hour.
Netflix's Kayvee anomaly engine runs on top of their Atlas metrics platform, processing 1B+ metric data points/day. Kayvee uses DBSCAN clustering to group microservices with similar traffic signatures and evaluates each against a 4-week rolling baseline. When stream-start rates in a specific AWS region drop 15% below baseline for 3 consecutive minutes, Kayvee fires a geo-localized alert. This detected a CDN degradation in ap-southeast-1 that had no impact on global error rate—a cross-region static threshold would have stayed silent.
Meta open-sourced Prophet in 2017, their production forecasting library for business time-series. Prophet uses additive decomposition (trend + weekly seasonality + yearly seasonality + holiday effects). Meta applies it to daily active users and ad impression volume where Black Friday creates step-changes that confuse naive Z-score models. Engineers add custom "regressor" variables (e.g., is_campaign=1) to explain expected step-changes in the baseline, preventing anomaly alerts from firing on planned marketing events.
Anti-Patterns
Running anomaly detection on highly volatile, low-volume metrics (e.g., error counts that hover between 0 and 2). This leads to massive statistical noise and constant false alerts.
Failing to account for scheduled events like marketing campaigns, load tests, or system maintenance. The system will flag these expected surges as critical anomalies.
Implementing complex, uninterpretable deep learning models for basic time-series alerting. When the model triggers an alert, engineers cannot understand why, leading them to ignore it.
Failing to provide a mechanism for engineers to mark an anomaly as a false positive, causing the system to continue alerting on the same normal behavior indefinitely.
Design Tradeoffs
| Dimension | Dynamic Anomaly Detection | Static Threshold |
|---|---|---|
| Seasonal adaptability | Automatically accounts for daily, weekly, and holiday seasonality; no manual re-tuning required as traffic patterns shift | Cannot adapt; requires constant manual threshold adjustment when business patterns change (nights, weekends, holidays) |
| Alert comprehensibility | Can be opaque; engineers need a baseline envelope visualization to understand why the model fired | Immediately understandable; checkout_rate < 100 for 5 min requires no model knowledge to interpret |
| Low-volume suitability | Poor; statistical models produce high variance and false positives on metrics with fewer than ~100 events/min | Works for any volume, including binary (0 or 1 per interval) metrics where statistical baselines are meaningless |
Best Practices
When to Use / Avoid
| Use When | Avoid When |
|---|---|
| Monitoring high-volume, highly seasonal metrics where "normal" changes drastically by time of day or day of week. | Monitoring critical system resources with hard physical limits (e.g., disk capacity, memory exhaustion). |
| You want to detect subtle, slow-drifting degradations that do not cross absolute static thresholds. | Operating early-stage systems with highly volatile, unpredictable traffic patterns and no historical baseline. |
| Managing large-scale fleets of identical servers where manual threshold configuration per host is impossible. | Monitoring low-throughput endpoints where a single request can cause 100% variance in the metric. |