← System Design Observability
System Design

SLO-Based Alerting

Traditional threshold alerts fail at scale by triggering too late for slow leaks or too often for transient spikes.

TL;DR
  • Traditional threshold alerts fail at scale by triggering too late for slow leaks or too often for transient spikes.
  • Service Level Objectives (SLOs) define target reliability, while the Error Budget is the allowable room for failure.
  • Burn rate alerts calculate how fast the error budget is being consumed, triggering pages based on time-to-exhaustion.
  • Multi-window, multi-burn-rate alerts eliminate false positives while ensuring rapid response to catastrophic outages.

The Problem

Static threshold alerts (e.g., "Error rate > 1% for 5 minutes") are highly fragile. If a service experiences a brief 2-minute spike of 5% errors due to a transient network blip, the alert triggers, waking up an engineer for an issue that self-healed. Conversely, if a service experiences a continuous 0.5% error rate due to a subtle bug, it will never cross the 1% threshold. However, over the course of a week, this "minor" bug will impact thousands of customers and silently destroy the service's reliability. Engineers are trapped between constant false alarms and silent, slow-burning production failures.

Core System Idea

SLO-based alerting shifts the focus from arbitrary thresholds to the consumption of the "Error Budget."

First, we define a Service Level Indicator (SLI), which is the ratio of good events to total events (e.g., successful HTTP requests / total HTTP requests).

Next, we set a Service Level Objective (SLO), which is the target reliability over a rolling window (e.g., 99.9% success over 30 days). The remaining percentage (0.1%) is our Error Budget—the amount of pain we are willing to let our users experience.

Instead of alerting on raw error rates, we alert on the "Burn Rate"—the speed at which we are consuming this budget. A burn rate of 1 means the budget will last exactly 30 days. A burn rate of 14.4 means we will exhaust 2% of our entire 30-day budget in just 1 hour. By calculating burn rates across multiple time windows (e.g., short 1-hour windows for rapid detection, and long 6-hour windows for slow leaks), we can accurately page engineers only when there is a real threat to our reliability target.

System Flow

flowchart TD A[Request Stream] --> B["SLI Engine: Good vs Bad"] B --> C["Error Budget Tracker: 30-Day Window"] C --> D{"Burn Rate Calculator"} D -- "Rate over 14.4 in 1hr" --> E["Critical Page: Active Outage"] D -- "Rate over 3 in 6hr" --> F["Ticket: Slow Leak"] D -- "Rate under 1" --> G["No Alert: Budget Safe"]

The SLI engine evaluates requests to track error budget consumption, triggering critical pages for rapid budget depletion and tickets for slow, sustained leaks.

Real-World Examples Indicative

Google SRE Two-Window Alerting

Google's SRE book defines the standard burn rate thresholds: page immediately when burn rate exceeds 14.4× over 1 hour (consuming 2% of the 30-day budget), and ticket when burn rate exceeds 6× over 6 hours (consuming 5% of budget). For a 99.9% SLO, the total monthly budget is 43.8 minutes. Datadog implements this natively: avg(last_1h):avg:slo.burn_rate{service:checkout} > 14.4 for P1 and avg(last_6h):avg:slo.burn_rate{service:checkout} > 6 for P2, with both conditions required to fire to eliminate noise.

Atlassian Multi-Window Burn Rates

Atlassian applies burn-rate alerting to Jira, Confluence, and Bitbucket Cloud. Their P1 threshold is 14.4× over 1 hour; their P3 (slow-burn) threshold is 3× over 72 hours, catching degraded-but-not-broken service states that static thresholds miss entirely. After rolling out multi-window alerting in 2020, Atlassian reduced on-call pages by ~40% while MTTR held constant, because engineers stopped waking up for transient spikes that self-resolved.

Spotify Error Budget Policy

Spotify defines SLOs per squad and tracks error budget spend in their internal "SLO Scoreboard." If a squad exhausts more than 50% of their monthly budget in Week 1, they are automatically put into "reliability sprint" mode—all feature work pauses and the squad focuses exclusively on stability. This policy is contractual between Platform Engineering and product squads, and burn rate dashboards are reviewed in every weekly engineering leadership sync.

Anti-Patterns

Setting 100% SLOs

Aiming for perfect reliability. This is impossible, incredibly expensive, and prevents teams from deploying new features because any failure exhausts the budget.

Using Too Many SLOs

Defining dozens of SLOs for a single service. This dilutes focus; teams should maintain 2-3 critical user journey SLOs per service—availability, latency, and data correctness.

Ignoring the Error Budget

Treating the error budget as a vanity metric rather than an operational contract. If the budget is exhausted, feature deployments must be frozen to focus on reliability.

Single-Window Burn Alerts

Alerting on a short window only (e.g., 5 minutes). This causes extreme alert noise during brief, self-correcting network drops.

Design Tradeoffs

DimensionMulti-Window Burn RateStatic Threshold
False positive rateVery low; short window catches fast burns while long window confirms slow leaks before paging engineersHigh; brief transient spikes and network drops trigger pages for issues that self-resolve within minutes
Slow-burn detectionDetects sustained low error rates (e.g., 0.5%) that never cross a raw threshold but exhaust budget over daysBlind to slow leaks below the threshold; a 0.5% error rate running for a week goes entirely undetected
Configuration complexityRequires computing 2-3 burn rates, maintaining 30-day rolling windows, and coordinating error budget policiesSimple single PromQL rule like error_rate > 0.01 covers the basic case with no window algebra

Best Practices

Define SLIs from the User's PerspectiveMeasure success at the customer boundary (e.g., API gateway or frontend), not deep inside internal database helpers.
Use Standard Burn Rate WindowsImplement Google's recommended thresholds: page if 2% of budget burns in 1 hour (burn rate 14.4×) and ticket if 5% burns in 6 hours (burn rate 6×).
Automate SLO CalculationsUse declarative tools like OpenSLO to define SLOs in code alongside your application deployment manifests, version-controlling reliability targets.
Establish an Error Budget PolicyCreate a binding agreement between product and engineering that defines exactly what happens when the error budget is exhausted (e.g., shifting 100% of sprint capacity to reliability work).

When to Use / Avoid

Use WhenAvoid When
Operating user-facing production services with high request volumes where reliability directly impacts business revenue.Managing pre-production environments, internal staging clusters, or low-priority internal tools.
Teams are suffering from severe alert fatigue and need to drastically reduce non-actionable pages.Operating extremely low-traffic systems (e.g., 10 requests a day), where statistical burn rates are volatile and meaningless.
You need a data-driven framework to negotiate feature velocity versus engineering time spent on technical debt.Building early-stage prototypes where rapid experimentation is prioritized over any form of stability.