SLO / SLA / SLI Design
SLIs measure actual user-visible performance, SLOs define target reliability with an error budget as enforcement mechanism, and SLAs define legal consequences — only the SLO is an engineering artifact; the SLA is a business contract that should never be the primary alerting threshold.
- 99.9% SLO = 43.8 minutes of downtime budget per month; 99.99% = 4.38 minutes. Moving from three nines to four nines means eliminating all brownouts and planned maintenance windows, not just major incidents.
- Burn rate of 14.4× means you're consuming 2% of your 30-day error budget in a single hour — the Google SRE threshold for an immediate page. At 1× burn rate, the budget exhausts in exactly 30 days.
- SLI must be measured at the user boundary (API gateway, CDN edge), not at internal service boundaries. A database at 100% CPU that still returns results within SLO is not causing a reliability violation.
- Error budgets create a forcing function: when the budget is exhausted, feature deploys stop and reliability work takes priority. Without an enforced policy, the budget is a dashboard number with no teeth.
- SLA thresholds should be set 10–20% more lenient than the SLO. If the SLO is 99.9%, the SLA should be 99.7% — engineers alert on SLO burn rate, not SLA breach.
The Problem
Teams without SLOs make reliability decisions reactively: after a major incident, alerts are tightened; after a quiet quarter, features ship recklessly. Product and engineering constantly argue about whether to delay a release to "fix stability issues" — a subjective debate with no shared data. At 99.9% theoretical uptime, a team has 43.8 minutes of allowed downtime per month, but without an error budget they don't know if they've spent 2 minutes or 40 minutes this month. The on-call engineer who pages during a 0.5% error rate lasting 10 minutes has no framework to determine whether this is burning through the budget too fast or is a normal variation within tolerance.
Core System Idea
Reliability is managed through a three-layer hierarchy: (1) SLI (Service Level Indicator) — the raw measurement computed from telemetry. Request success SLI: (requests returning 2xx or 3xx) / total_requests. Latency SLI: (requests completing in <200ms) / total_requests. SLIs must be defined at the user-visible boundary — not at internal service-to-service boundaries where retries and failovers can absorb failures before they reach users. (2) SLO (Service Level Objective) — the target reliability level over a rolling window: 99.9% of requests succeed over 30 days. The complement defines the error budget: 0.1% of requests over 30 days may fail. For a service at 1M requests/day, that is 1,000 failures/day allowed. Error budget mechanics: burn rate = actual consumption rate relative to the rate that exhausts the budget in 30 days. A burn rate of 14.4 means the budget is consumed 14.4× faster than sustainable — at this rate, 2% of the monthly budget burns in one hour. (3) SLA (Service Level Agreement) — the legal contract specifying financial penalties if reliability falls below a threshold. Always set 10–20% below the SLO. Engineers alert on SLO burn; if they first learn of a violation through contract invocation, the monitoring has failed. The Google SRE book defines two alerting windows: fast burn (>14.4× for 1 hour, triggers a page) and slow burn (>6× for 6 hours, triggers a ticket) — the slow window catches intermittent budget drains that produce no individual alarming spike.
System Flow
The error budget feedback loop dynamically balances engineering velocity with system reliability — healthy budget permits feature shipping; depleted budget mandates a reliability freeze.
Real-World Examples Indicative
Google's SRE team codified the canonical burn rate thresholds in the SRE Workbook: page at 14.4× burn rate over 1 hour (consuming 2% of the monthly budget — at this pace the budget exhausts in ~2 days) and ticket at 6× over 6 hours (consuming 5% of the budget — at this pace it exhausts in 5 days). Google Search's SLO is measured at the global load balancer tier, not at individual datacenter boundaries. An entire datacenter failure that is transparently absorbed by global routing does not consume Search's error budget — only failures visible to end-user requests count. This distinction prevents internal infrastructure noise from consuming the budget of a globally redundant service.
Atlassian's reliability platform implements the two-window strategy across all production services: fast burn at 14.4× over 1 hour triggers a P1 PagerDuty page; slow burn at 3× over 72 hours triggers a P3 ticket. Without the 72-hour slow-burn window, an intermittent 0.3% error rate sustained for days silently consumes 90% of the monthly budget before triggering any alert. With it, the slow leak is detected after consuming ~18% of the budget — enough runway to investigate and fix the root cause before the SLO is violated. Atlassian publishes a status page that reflects SLO health directly, updated from the same error-budget metric that drives internal alerting.
Datadog computes burn rate as error_rate / (1 - slo_target). For a 99.9% SLO with a current 5-minute error rate of 5%: burn rate = 5% / 0.1% = 50×. Datadog emits this as a slo.burn_rate metric alertable via: avg(last_1h):avg:slo.burn_rate{service:checkout} > 14.4. This surfaces budget consumption rate rather than instantaneous error rate, eliminating alert fatigue from transient 30-second spikes that do not meaningfully affect the 30-day budget. Datadog's SLO widget also shows remaining budget as a percentage — a single-glance gauge that product and engineering share as the deployment-decision artifact.
Anti-Patterns
CPU utilization, memory usage, and database query counts are not SLIs — they measure internal health, not user experience. A service at 90% CPU that is returning results within SLO is not violating a reliability target. Users experience request latency and success rate, not CPU utilization.
Five nines = 5.26 minutes of downtime per year. Achieving this requires eliminating all planned maintenance windows, multi-zone redundancy for every dependency, and zero-downtime deploys end to end. Engineering cost scales exponentially past 99.9%. Most B2B SaaS products have SLAs of 99.9%; their SLOs should be 99.95%, not 99.999%.
Alerting at error_rate > 1% for 5 minutes catches severe outages but creates alert fatigue from transient blips and misses slow drains. A 0.5% error rate sustained for 2 hours consumes more monthly budget than a 5% rate lasting 10 minutes — but only the latter triggers the alert.
An error budget without an enforcement policy is decorative. The policy must be in writing: when the 30-day budget reaches 0%, feature releases stop and reliability work takes priority. Without this binding agreement between product and engineering, the budget has no operational force.
Design Tradeoffs
| Dimension | Availability SLO | Latency SLO |
|---|---|---|
| Measurement | Binary ratio: successful requests / total requests | Percentile-based: fraction of requests below latency threshold |
| Failure capture | Misses degraded-but-functional scenarios — service is slow but returning 200s | Directly captures user frustration from slow responses |
| Aggregation across services | Simple AND of ratios; composes naturally | Complex — P99 doesn't compose across services; requires histogram merge |
| Error budget calculation | Straightforward: allowed_failures = total_requests × (1 - SLO) | Requires histogram_quantile or explicit bucket counts in Prometheus |
Best Practices
When to Use / Avoid
| Use When | Avoid When |
|---|---|
| Managing production services with active users and business-critical operations | Building early-stage prototypes where speed is the only metric that matters |
| Aligning cross-functional product and engineering teams on roadmap priorities via a shared error budget | Operating internal, non-critical tools with small, highly tolerant user bases |
| Designing automated alerting systems to reduce on-call fatigue and false alarms | Working in environments where infrastructure and deployment pipelines are not yet automated |