SLO-Based Alerting
Traditional threshold alerts fail at scale by triggering too late for slow leaks or too often for transient spikes.
- Traditional threshold alerts fail at scale by triggering too late for slow leaks or too often for transient spikes.
- Service Level Objectives (SLOs) define target reliability, while the Error Budget is the allowable room for failure.
- Burn rate alerts calculate how fast the error budget is being consumed, triggering pages based on time-to-exhaustion.
- Multi-window, multi-burn-rate alerts eliminate false positives while ensuring rapid response to catastrophic outages.
The Problem
Static threshold alerts (e.g., "Error rate > 1% for 5 minutes") are highly fragile. If a service experiences a brief 2-minute spike of 5% errors due to a transient network blip, the alert triggers, waking up an engineer for an issue that self-healed. Conversely, if a service experiences a continuous 0.5% error rate due to a subtle bug, it will never cross the 1% threshold. However, over the course of a week, this "minor" bug will impact thousands of customers and silently destroy the service's reliability. Engineers are trapped between constant false alarms and silent, slow-burning production failures.
Core System Idea
SLO-based alerting shifts the focus from arbitrary thresholds to the consumption of the "Error Budget."
First, we define a Service Level Indicator (SLI), which is the ratio of good events to total events (e.g., successful HTTP requests / total HTTP requests).
Next, we set a Service Level Objective (SLO), which is the target reliability over a rolling window (e.g., 99.9% success over 30 days). The remaining percentage (0.1%) is our Error Budget—the amount of pain we are willing to let our users experience.
Instead of alerting on raw error rates, we alert on the "Burn Rate"—the speed at which we are consuming this budget. A burn rate of 1 means the budget will last exactly 30 days. A burn rate of 14.4 means we will exhaust 2% of our entire 30-day budget in just 1 hour. By calculating burn rates across multiple time windows (e.g., short 1-hour windows for rapid detection, and long 6-hour windows for slow leaks), we can accurately page engineers only when there is a real threat to our reliability target.
System Flow
The SLI engine evaluates requests to track error budget consumption, triggering critical pages for rapid budget depletion and tickets for slow, sustained leaks.
Real-World Examples Indicative
Google's SRE book defines the standard burn rate thresholds: page immediately when burn rate exceeds 14.4× over 1 hour (consuming 2% of the 30-day budget), and ticket when burn rate exceeds 6× over 6 hours (consuming 5% of budget). For a 99.9% SLO, the total monthly budget is 43.8 minutes. Datadog implements this natively: avg(last_1h):avg:slo.burn_rate{service:checkout} > 14.4 for P1 and avg(last_6h):avg:slo.burn_rate{service:checkout} > 6 for P2, with both conditions required to fire to eliminate noise.
Atlassian applies burn-rate alerting to Jira, Confluence, and Bitbucket Cloud. Their P1 threshold is 14.4× over 1 hour; their P3 (slow-burn) threshold is 3× over 72 hours, catching degraded-but-not-broken service states that static thresholds miss entirely. After rolling out multi-window alerting in 2020, Atlassian reduced on-call pages by ~40% while MTTR held constant, because engineers stopped waking up for transient spikes that self-resolved.
Spotify defines SLOs per squad and tracks error budget spend in their internal "SLO Scoreboard." If a squad exhausts more than 50% of their monthly budget in Week 1, they are automatically put into "reliability sprint" mode—all feature work pauses and the squad focuses exclusively on stability. This policy is contractual between Platform Engineering and product squads, and burn rate dashboards are reviewed in every weekly engineering leadership sync.
Anti-Patterns
Aiming for perfect reliability. This is impossible, incredibly expensive, and prevents teams from deploying new features because any failure exhausts the budget.
Defining dozens of SLOs for a single service. This dilutes focus; teams should maintain 2-3 critical user journey SLOs per service—availability, latency, and data correctness.
Treating the error budget as a vanity metric rather than an operational contract. If the budget is exhausted, feature deployments must be frozen to focus on reliability.
Alerting on a short window only (e.g., 5 minutes). This causes extreme alert noise during brief, self-correcting network drops.
Design Tradeoffs
| Dimension | Multi-Window Burn Rate | Static Threshold |
|---|---|---|
| False positive rate | Very low; short window catches fast burns while long window confirms slow leaks before paging engineers | High; brief transient spikes and network drops trigger pages for issues that self-resolve within minutes |
| Slow-burn detection | Detects sustained low error rates (e.g., 0.5%) that never cross a raw threshold but exhaust budget over days | Blind to slow leaks below the threshold; a 0.5% error rate running for a week goes entirely undetected |
| Configuration complexity | Requires computing 2-3 burn rates, maintaining 30-day rolling windows, and coordinating error budget policies | Simple single PromQL rule like error_rate > 0.01 covers the basic case with no window algebra |
Best Practices
When to Use / Avoid
| Use When | Avoid When |
|---|---|
| Operating user-facing production services with high request volumes where reliability directly impacts business revenue. | Managing pre-production environments, internal staging clusters, or low-priority internal tools. |
| Teams are suffering from severe alert fatigue and need to drastically reduce non-actionable pages. | Operating extremely low-traffic systems (e.g., 10 requests a day), where statistical burn rates are volatile and meaningless. |
| You need a data-driven framework to negotiate feature velocity versus engineering time spent on technical debt. | Building early-stage prototypes where rapid experimentation is prioritized over any form of stability. |