← System Design Reliability Engineering
System Design

Graceful Degradation

Graceful degradation intentionally reduces feature fidelity under stress to keep core business functions operational — Amazon sheds product recommendations before checkout, never the other way around.

TL;DR
  • Classify every feature into tiers before an incident, not during one. Tier 1 (checkout, auth) is never shed; Tier 3 (recommendations, hovercards) is shed first. This decision must be made by engineering and product, not by whoever is on call at 2am.
  • Amazon disables product recommendations when checkout DB CPU exceeds 80% — the recommendation join queries are known to be expensive and are the first feature shed. This is planned load shedding, not reactive panic.
  • Fallback paths that hit the same saturated database make things worse, not better. A degraded-mode fallback must reduce load on the failing component, not route traffic through it.
  • Feature flags that require manual operator intervention fail during incidents — the engineer is busy. Use automated load shedding triggered by metrics thresholds.
  • Never disable a feature silently. A user who sees an empty recommendation panel retries the page, increasing load. Show a message or a static fallback — anything that signals the UI is intentionally reduced.

The Problem

A flash sale generates 10× normal traffic. The product recommendation service — which joins across 5 tables to compute personalized suggestions — consumes all available database read capacity. Checkout queries start timing out because the DB is at 100% CPU. Engineers disable the recommendation service, but the fallback is an empty panel that makes users think the page is broken — they refresh repeatedly, adding more load. The incident takes 45 minutes to resolve because the fallback path was never designed, tested, or deployed.

Core System Idea

Graceful degradation is the architectural practice of deliberately reducing feature fidelity when under stress to protect core business functions. Three mechanisms: (1) Feature tiering — classify every feature by its criticality before an incident: Tier 1 (checkout, auth, payment — never shed), Tier 2 (search, product catalog — shed under severe load), Tier 3 (recommendations, analytics, social widgets — shed first). Shed decisions should be made in normal operation, not during incidents. (2) Automated load shedding — trigger feature toggles based on real-time metrics (CPU, P99 latency, error rate) without requiring human intervention. When checkout DB CPU exceeds threshold, the recommendation query is automatically bypassed and a cached fallback is served. Flapping (rapid on/off cycling) is prevented with hysteresis: disable at 80% CPU, re-enable only when CPU drops below 60%. (3) Degraded-mode fallbacks — every Tier 2 and Tier 3 feature must have a pre-built fallback: a pre-computed cached response, a static list, or a graceful empty state with user messaging. The fallback must reduce load on the failing component — a fallback that calls the same saturated service makes degradation worse.

System Flow

flowchart TD A["Incoming Request"] --> B{"System Load Level?"} B -- "Normal" --> C["Full Feature Path"] B -- "High Load" --> D{"Feature Tier?"} D -- "Tier 1 Critical" --> E["Execute with Cached Fallback"] D -- "Tier 3 Optional" --> F["Shed Feature"] C --> G["Full Response"] E --> H["Degraded Response"] F --> H

Load level gates feature execution; critical Tier 1 paths run with cached fallbacks; Tier 3 optional features are shed entirely under high load.

Real-World Examples Indicative

Netflix's personalization fallback

Netflix's personalization service has a 50ms latency budget on the home page request path. If the recommendation service exceeds budget or returns an error, Netflix falls back to a pre-computed top-50 popular titles list cached in Redis and refreshed hourly. Users see a generic popular list instead of personalized results — acceptable degradation. Netflix measures the fallback rate as a key SLO metric: >0.1% fallback rate triggers an alert, because the personalization service should rarely need to degrade. This threshold-based alert distinguishes "working as designed" degradation from "system is actually broken" degradation.

Amazon's tiered load shedding

During Prime Day, Amazon operates explicit load-shedding tiers triggered by checkout DB CPU: at 70% CPU, disable personalized product recommendations (expensive cross-table joins); at 80%, disable review aggregation counts (read-heavy secondary queries); at 90%, serve only cached product images and disable live inventory counts. The Tier 1 checkout path (add to cart → payment → order confirmation) is never touched. Amazon engineers define and test these tiers months before Prime Day — not during it. The key insight: each shed tier has a known, measured DB query reduction, so engineers know exactly how much capacity each shed recovers.

GitHub's incremental feature disabling

GitHub uses Scientist (open-source A/B testing framework) to measure the performance cost of individual features. During DB overload events, GitHub disables features in sequence based on their measured cost: (1) hovercards (username popup previews) at P50 latency >100ms — these trigger a DB lookup on hover; (2) traffic graph rendering at P50 >200ms — expensive time-series queries; (3) real-time issue updates via ActionCable at P50 >300ms — these maintain open WebSocket connections that add DB polling. Each feature has a disable threshold derived from measured cost, not from intuition. Operators can also manually toggle features via an internal feature flag system.

Anti-Patterns

Undocumented feature tiers

No classification of which features can be shed under load. During an incident, the on-call engineer makes ad-hoc decisions about what to disable — wrong choices extend the incident.

Fallback paths that hit the same saturated resource

A recommendation fallback that queries the same DB to fetch "recent popular items" provides no relief. The fallback must serve from a different source (Redis cache, CDN edge, static file) that is independent of the failing component.

No user-visible degradation signal

Disabling a feature without any UI indication. Users see an empty panel, assume the page is broken, and refresh — each refresh adds load to the already-stressed system.

Load shedding with tight thresholds and no hysteresis

Disabling a feature at 80% CPU and re-enabling at 79% causes rapid on/off toggling that makes the feature unreliable and increases operational noise. Use a separate re-enable threshold (e.g., enable at 80%, disable only when CPU drops to 60%).

Design Tradeoffs

DimensionAutomated Load SheddingManual Feature Flags
Response timeImmediate — triggers on metric thresholdSlow — requires human action during incident
Flapping riskYes — requires hysteresis thresholdsNo — only changes when operator acts
Incident response burdenLow — operator just monitorsHigh — operator must diagnose and toggle manually
Best forHigh-traffic consumer apps with defined feature tiersLow-traffic apps or infrequent, planned maintenance

Best Practices

Classify every public-facing feature into tiers (Tier 1/2/3) before any incident occurs, with sign-off from product and engineering. This decision cannot be made under pressure.
Design every Tier 2/3 fallback to serve from a source that does NOT touch the failing component. Redis, CDN edge cache, and static files are the standard fallback sources.
Use hysteresis for automated shedding thresholds: shed at threshold T, re-enable only when metric falls to 0.75×T. This prevents oscillation.
Test fallback paths in production quarterly. Fallbacks that have never been exercised have hidden bugs. Run chaos tests that force the degraded path and verify the user experience is acceptable.
Emit a feature.degraded{name=recommendations} metric on every fallback serving. Alert the on-call team — graceful degradation is not invisible; it's a signal that the system is under stress.

When to Use / Avoid

Use WhenAvoid When
Consumer web app has distinct critical and non-critical features with clear business priority orderingFinancial transaction systems where partial execution produces incorrect state (partial transfers, partial trades)
Traffic is highly variable — e-commerce, media streaming, news sites during breaking eventsAll features are equally critical and no feature can be removed without breaking the core value proposition
Downstream dependencies have variable SLAs and can degrade unexpectedlySafety-critical systems where reduced functionality is unacceptable (medical, aviation, industrial control)