← System Design Observability
System Design

On-Call and Incident Response Design

Incident response requires clear, pre-defined roles (Incident Commander, Ops Lead, Communications Lead) to prevent chaotic triage.

TL;DR
  • Incident response requires clear, pre-defined roles (Incident Commander, Ops Lead, Communications Lead) to prevent chaotic triage.
  • Runbooks must be actionable, up-to-date, and linked directly from automated alerts to minimize Mean Time to Resolution (MTTR).
  • Severity levels (Sev 1 to Sev 3) must be defined by business and customer impact, not by engineering emotion.
  • Blameless postmortems treat failures as systemic opportunities to learn, converting incident data into automated toil-reduction tasks.

The Problem

When a major production outage occurs, the lack of a structured incident response process leads to chaos. Engineers self-assign tasks, duplicate debugging efforts, and argue over the root cause in public channels. Executives constantly ping the triage team for updates, pulling critical resources away from fixing the actual problem. Meanwhile, the on-call engineer is panicking, searching through outdated wiki pages for a runbook that hasn't been updated in two years. The outage drags on, customer trust plummets, and the subsequent postmortem degenerates into finger-pointing and blame.

Core System Idea

A resilient incident response system treats operational failure as a structured, repeatable process.

The core system relies on three pillars: Role-Based Incident Command, Declarative Runbooks, and Blameless Feedback Loops.

When a high-severity alert triggers, an incident is declared, and a dedicated Slack channel and video bridge are automatically provisioned. The system enforces the Incident Command System (ICS) model: the Incident Commander (IC) directs the response and delegates tasks, the Ops Lead focuses on technical triage, and the Communications Lead manages internal and external updates.

Every alert is mapped to a specific, version-controlled runbook.

Once resolved, the incident feeds into a blameless postmortem process that focuses on systemic vulnerabilities rather than human error, producing tracked engineering tickets to automate away the recurring toil that caused the incident.

System Flow

flowchart TD A[Critical Alert Triggers] --> B["Escalation Engine: Page On-Call"] B --> C["Declare Incident and Assign Roles"] C --> D["Incident Commander: Direct Triage"] D --> E["Ops Lead: Execute Runbook"] D --> F["Comms Lead: Update Stakeholders"] E --> G[Mitigation and Resolution] G --> H["Blameless Postmortem and Toil Tickets"]

Automated alerts trigger a structured incident response workflow, separating operational triage, communication, and command before concluding with a blameless postmortem.

Real-World Examples Indicative

PagerDuty Automated Incident Escalation

PagerDuty's incident response model defines 4 severity levels with automated escalation rules: Sev-1 (complete service failure, customer data loss) pages primary on-call via voice call immediately and automatically pages backup on-call if no acknowledgment within 5 minutes. A Slack bot auto-creates #inc-{id} channels, posts the alert link and assigned Incident Commander, and pings #statuspage every 30 minutes with a templated status update. This automation eliminates the first 10 minutes of chaos where engineers scramble to find each other.

Google SRE Blameless Postmortem

Google's SRE postmortem template mandates a timeline of contributing factors (never a single root cause) and requires a minimum of 3 action items with assigned owners and deadlines. After the 2015 Gmail outage (~25 minutes of impact), the postmortem identified 7 contributing causes spanning a config change, missing validation, and insufficient rollback tooling—rather than attributing blame to the engineer who pushed the config. Every Sev-1 COE at Google is reviewed by engineering leadership within 5 business days.

Slack 2022 Major Outage Postmortem

In February 2022, Slack experienced ~5 hours of degraded service affecting millions of users. Their published postmortem traced the incident through: a configuration change that altered DNS behavior, insufficient integration tests that missed the edge case, and a cascading failure across their connection brokers. The postmortem produced 14 tracked action items including mandatory canary testing for DNS config changes, automated regression tests for connection broker behavior, and a new "safe deployment" guardrail that restricts blast radius during rollouts to 1% of connection brokers at a time.

Anti-Patterns

The Hero Culture

Relying on a single senior engineer to fix every major outage. This leads to severe burnout, high turnover, and a failure to build institutional response knowledge.

Blaming the Human

Concluding that the root cause of an incident was "human error." Human error is a symptom of poor tooling, lack of guardrails, or bad processes—not the cause.

Alerting Without Runbooks

Allowing alerts to page engineers without a corresponding, up-to-date runbook. If an alert doesn't have a clear, actionable response, it should not be a high-priority page.

Executive Hijacking

Allowing executives or product managers to join triage channels and demand status updates directly from engineers who are actively writing mitigation code.

Design Tradeoffs

DimensionAutomated MitigationManual Runbook Mitigation
MTTRFastest; automation executes predefined steps in seconds without human diagnostic delaySlower; requires human to read the alert, navigate to the runbook, and execute steps under pressure
Safety in novel failuresRisky; scripts assume known failure patterns and can cascade if the actual state diverges from expectationsSafer; human judgment can halt mid-runbook when observed state contradicts the documented assumptions
Investment costHigh upfront; requires engineering time to build, test, chaos-test, and maintain remediation scriptsLow upfront; runbooks are markdown documents any team member can write and update alongside code

Best Practices

Define Clear Severity LevelsEstablish objective, business-impact criteria (e.g., Sev-1: core business flow down for >5% of users; Sev-2: degraded performance; Sev-3: minor bug with workaround available).
Separate Triage from CommunicationNever let the engineer actively fixing the problem write status updates. Appoint a separate Communications Lead to shield the triage team from interruptions.
Keep Runbooks in GitStore runbooks as Markdown files in the same repository as the application code, updated via pull requests alongside code changes so they never drift.
Track Toil BudgetsMeasure repetitive, manual operational work. If an on-call team spends more than 50% of their time on toil, halt feature work to automate those tasks.

When to Use / Avoid

Use WhenAvoid When
Operating production systems where downtime directly impacts customer revenue, safety, or brand reputation.Managing early-stage, pre-revenue startups where speed of development is prioritized over operational process.
Engineering teams are growing, and tribal knowledge is no longer sufficient to manage system complexity.Operating non-critical internal environments (e.g., sandbox, dev) where outages have zero business impact.
You need to build a sustainable, healthy on-call rotation that prevents engineer burnout and attrition.Small, single-developer projects where the author is the sole maintainer and operator.