← System Design Reliability Engineering
System Design

SLO / SLA / SLI Design

SLIs measure actual user-visible performance, SLOs define target reliability with an error budget as enforcement mechanism, and SLAs define legal consequences — only the SLO is an engineering artifact; the SLA is a business contract that should never be the primary alerting threshold.

TL;DR
  • 99.9% SLO = 43.8 minutes of downtime budget per month; 99.99% = 4.38 minutes. Moving from three nines to four nines means eliminating all brownouts and planned maintenance windows, not just major incidents.
  • Burn rate of 14.4× means you're consuming 2% of your 30-day error budget in a single hour — the Google SRE threshold for an immediate page. At 1× burn rate, the budget exhausts in exactly 30 days.
  • SLI must be measured at the user boundary (API gateway, CDN edge), not at internal service boundaries. A database at 100% CPU that still returns results within SLO is not causing a reliability violation.
  • Error budgets create a forcing function: when the budget is exhausted, feature deploys stop and reliability work takes priority. Without an enforced policy, the budget is a dashboard number with no teeth.
  • SLA thresholds should be set 10–20% more lenient than the SLO. If the SLO is 99.9%, the SLA should be 99.7% — engineers alert on SLO burn rate, not SLA breach.

The Problem

Teams without SLOs make reliability decisions reactively: after a major incident, alerts are tightened; after a quiet quarter, features ship recklessly. Product and engineering constantly argue about whether to delay a release to "fix stability issues" — a subjective debate with no shared data. At 99.9% theoretical uptime, a team has 43.8 minutes of allowed downtime per month, but without an error budget they don't know if they've spent 2 minutes or 40 minutes this month. The on-call engineer who pages during a 0.5% error rate lasting 10 minutes has no framework to determine whether this is burning through the budget too fast or is a normal variation within tolerance.

Core System Idea

Reliability is managed through a three-layer hierarchy: (1) SLI (Service Level Indicator) — the raw measurement computed from telemetry. Request success SLI: (requests returning 2xx or 3xx) / total_requests. Latency SLI: (requests completing in <200ms) / total_requests. SLIs must be defined at the user-visible boundary — not at internal service-to-service boundaries where retries and failovers can absorb failures before they reach users. (2) SLO (Service Level Objective) — the target reliability level over a rolling window: 99.9% of requests succeed over 30 days. The complement defines the error budget: 0.1% of requests over 30 days may fail. For a service at 1M requests/day, that is 1,000 failures/day allowed. Error budget mechanics: burn rate = actual consumption rate relative to the rate that exhausts the budget in 30 days. A burn rate of 14.4 means the budget is consumed 14.4× faster than sustainable — at this rate, 2% of the monthly budget burns in one hour. (3) SLA (Service Level Agreement) — the legal contract specifying financial penalties if reliability falls below a threshold. Always set 10–20% below the SLO. Engineers alert on SLO burn; if they first learn of a violation through contract invocation, the monitoring has failed. The Google SRE book defines two alerting windows: fast burn (>14.4× for 1 hour, triggers a page) and slow burn (>6× for 6 hours, triggers a ticket) — the slow window catches intermittent budget drains that produce no individual alarming spike.

System Flow

flowchart TD A["Raw Metrics: Latency, Errors"] --> B["Calculate SLI"] B --> C{"SLI within SLO Target?"} C -- "Yes" --> D["Error Budget Healthy"] D --> E["Continue Shipping Features"] C -- "No" --> F["Error Budget Consumed"] F --> G["Freeze Deployments"] F --> H["Trigger Burn Rate Alert"]

The error budget feedback loop dynamically balances engineering velocity with system reliability — healthy budget permits feature shipping; depleted budget mandates a reliability freeze.

Real-World Examples Indicative

Google SRE multi-window burn rate alerting

Google's SRE team codified the canonical burn rate thresholds in the SRE Workbook: page at 14.4× burn rate over 1 hour (consuming 2% of the monthly budget — at this pace the budget exhausts in ~2 days) and ticket at 6× over 6 hours (consuming 5% of the budget — at this pace it exhausts in 5 days). Google Search's SLO is measured at the global load balancer tier, not at individual datacenter boundaries. An entire datacenter failure that is transparently absorbed by global routing does not consume Search's error budget — only failures visible to end-user requests count. This distinction prevents internal infrastructure noise from consuming the budget of a globally redundant service.

Atlassian's two-window alerting in production

Atlassian's reliability platform implements the two-window strategy across all production services: fast burn at 14.4× over 1 hour triggers a P1 PagerDuty page; slow burn at 3× over 72 hours triggers a P3 ticket. Without the 72-hour slow-burn window, an intermittent 0.3% error rate sustained for days silently consumes 90% of the monthly budget before triggering any alert. With it, the slow leak is detected after consuming ~18% of the budget — enough runway to investigate and fix the root cause before the SLO is violated. Atlassian publishes a status page that reflects SLO health directly, updated from the same error-budget metric that drives internal alerting.

Datadog SLO burn rate monitors

Datadog computes burn rate as error_rate / (1 - slo_target). For a 99.9% SLO with a current 5-minute error rate of 5%: burn rate = 5% / 0.1% = 50×. Datadog emits this as a slo.burn_rate metric alertable via: avg(last_1h):avg:slo.burn_rate{service:checkout} > 14.4. This surfaces budget consumption rate rather than instantaneous error rate, eliminating alert fatigue from transient 30-second spikes that do not meaningfully affect the 30-day budget. Datadog's SLO widget also shows remaining budget as a percentage — a single-glance gauge that product and engineering share as the deployment-decision artifact.

Anti-Patterns

Measuring internal infrastructure as SLIs

CPU utilization, memory usage, and database query counts are not SLIs — they measure internal health, not user experience. A service at 90% CPU that is returning results within SLO is not violating a reliability target. Users experience request latency and success rate, not CPU utilization.

Setting 99.999% SLOs on non-critical services

Five nines = 5.26 minutes of downtime per year. Achieving this requires eliminating all planned maintenance windows, multi-zone redundancy for every dependency, and zero-downtime deploys end to end. Engineering cost scales exponentially past 99.9%. Most B2B SaaS products have SLAs of 99.9%; their SLOs should be 99.95%, not 99.999%.

Single-window threshold alerting

Alerting at error_rate > 1% for 5 minutes catches severe outages but creates alert fatigue from transient blips and misses slow drains. A 0.5% error rate sustained for 2 hours consumes more monthly budget than a 5% rate lasting 10 minutes — but only the latter triggers the alert.

No error budget policy

An error budget without an enforcement policy is decorative. The policy must be in writing: when the 30-day budget reaches 0%, feature releases stop and reliability work takes priority. Without this binding agreement between product and engineering, the budget has no operational force.

Design Tradeoffs

DimensionAvailability SLOLatency SLO
MeasurementBinary ratio: successful requests / total requestsPercentile-based: fraction of requests below latency threshold
Failure captureMisses degraded-but-functional scenarios — service is slow but returning 200sDirectly captures user frustration from slow responses
Aggregation across servicesSimple AND of ratios; composes naturallyComplex — P99 doesn't compose across services; requires histogram merge
Error budget calculationStraightforward: allowed_failures = total_requests × (1 - SLO)Requires histogram_quantile or explicit bucket counts in Prometheus

Best Practices

Measure SLIs at the user boundary (API gateway, CDN edge), not at internal service boundaries. Internal service health doesn't matter if the gateway absorbs failures via retries before they reach users — those retries succeed and never count against the SLO.
Use multi-window burn rate alerting: fast burn (14.4× for 1 hour) for immediate pages and slow burn (3–6× for 6–72 hours) for tickets. The fast window catches outages; the slow window catches budget drains that produce no single alarming spike.
Set SLA thresholds 10–20% below the SLO. If the SLO is 99.9% (43.8 min/month), the SLA should be 99.7% (~130 min/month). Engineers alert on SLO burn rate — first detecting a violation through a customer invoking the SLA contract means monitoring has already failed.
Enforce the error budget policy in a written document signed by product and engineering leadership: when the 30-day budget reaches 0%, deployment freeze is mandatory, not discretionary.
Recalibrate SLOs quarterly. A target set when the service handled 10K requests/day may be structurally impossible or trivially easy at 10M requests/day — rare events that once happened monthly now happen hourly.

When to Use / Avoid

Use WhenAvoid When
Managing production services with active users and business-critical operationsBuilding early-stage prototypes where speed is the only metric that matters
Aligning cross-functional product and engineering teams on roadmap priorities via a shared error budgetOperating internal, non-critical tools with small, highly tolerant user bases
Designing automated alerting systems to reduce on-call fatigue and false alarmsWorking in environments where infrastructure and deployment pipelines are not yet automated