Graceful Degradation
Graceful degradation intentionally reduces feature fidelity under stress to keep core business functions operational — Amazon sheds product recommendations before checkout, never the other way around.
- Classify every feature into tiers before an incident, not during one. Tier 1 (checkout, auth) is never shed; Tier 3 (recommendations, hovercards) is shed first. This decision must be made by engineering and product, not by whoever is on call at 2am.
- Amazon disables product recommendations when checkout DB CPU exceeds 80% — the recommendation join queries are known to be expensive and are the first feature shed. This is planned load shedding, not reactive panic.
- Fallback paths that hit the same saturated database make things worse, not better. A degraded-mode fallback must reduce load on the failing component, not route traffic through it.
- Feature flags that require manual operator intervention fail during incidents — the engineer is busy. Use automated load shedding triggered by metrics thresholds.
- Never disable a feature silently. A user who sees an empty recommendation panel retries the page, increasing load. Show a message or a static fallback — anything that signals the UI is intentionally reduced.
The Problem
A flash sale generates 10× normal traffic. The product recommendation service — which joins across 5 tables to compute personalized suggestions — consumes all available database read capacity. Checkout queries start timing out because the DB is at 100% CPU. Engineers disable the recommendation service, but the fallback is an empty panel that makes users think the page is broken — they refresh repeatedly, adding more load. The incident takes 45 minutes to resolve because the fallback path was never designed, tested, or deployed.
Core System Idea
Graceful degradation is the architectural practice of deliberately reducing feature fidelity when under stress to protect core business functions. Three mechanisms: (1) Feature tiering — classify every feature by its criticality before an incident: Tier 1 (checkout, auth, payment — never shed), Tier 2 (search, product catalog — shed under severe load), Tier 3 (recommendations, analytics, social widgets — shed first). Shed decisions should be made in normal operation, not during incidents. (2) Automated load shedding — trigger feature toggles based on real-time metrics (CPU, P99 latency, error rate) without requiring human intervention. When checkout DB CPU exceeds threshold, the recommendation query is automatically bypassed and a cached fallback is served. Flapping (rapid on/off cycling) is prevented with hysteresis: disable at 80% CPU, re-enable only when CPU drops below 60%. (3) Degraded-mode fallbacks — every Tier 2 and Tier 3 feature must have a pre-built fallback: a pre-computed cached response, a static list, or a graceful empty state with user messaging. The fallback must reduce load on the failing component — a fallback that calls the same saturated service makes degradation worse.
System Flow
Load level gates feature execution; critical Tier 1 paths run with cached fallbacks; Tier 3 optional features are shed entirely under high load.
Real-World Examples Indicative
Netflix's personalization service has a 50ms latency budget on the home page request path. If the recommendation service exceeds budget or returns an error, Netflix falls back to a pre-computed top-50 popular titles list cached in Redis and refreshed hourly. Users see a generic popular list instead of personalized results — acceptable degradation. Netflix measures the fallback rate as a key SLO metric: >0.1% fallback rate triggers an alert, because the personalization service should rarely need to degrade. This threshold-based alert distinguishes "working as designed" degradation from "system is actually broken" degradation.
During Prime Day, Amazon operates explicit load-shedding tiers triggered by checkout DB CPU: at 70% CPU, disable personalized product recommendations (expensive cross-table joins); at 80%, disable review aggregation counts (read-heavy secondary queries); at 90%, serve only cached product images and disable live inventory counts. The Tier 1 checkout path (add to cart → payment → order confirmation) is never touched. Amazon engineers define and test these tiers months before Prime Day — not during it. The key insight: each shed tier has a known, measured DB query reduction, so engineers know exactly how much capacity each shed recovers.
GitHub uses Scientist (open-source A/B testing framework) to measure the performance cost of individual features. During DB overload events, GitHub disables features in sequence based on their measured cost: (1) hovercards (username popup previews) at P50 latency >100ms — these trigger a DB lookup on hover; (2) traffic graph rendering at P50 >200ms — expensive time-series queries; (3) real-time issue updates via ActionCable at P50 >300ms — these maintain open WebSocket connections that add DB polling. Each feature has a disable threshold derived from measured cost, not from intuition. Operators can also manually toggle features via an internal feature flag system.
Anti-Patterns
No classification of which features can be shed under load. During an incident, the on-call engineer makes ad-hoc decisions about what to disable — wrong choices extend the incident.
A recommendation fallback that queries the same DB to fetch "recent popular items" provides no relief. The fallback must serve from a different source (Redis cache, CDN edge, static file) that is independent of the failing component.
Disabling a feature without any UI indication. Users see an empty panel, assume the page is broken, and refresh — each refresh adds load to the already-stressed system.
Disabling a feature at 80% CPU and re-enabling at 79% causes rapid on/off toggling that makes the feature unreliable and increases operational noise. Use a separate re-enable threshold (e.g., enable at 80%, disable only when CPU drops to 60%).
Design Tradeoffs
| Dimension | Automated Load Shedding | Manual Feature Flags |
|---|---|---|
| Response time | Immediate — triggers on metric threshold | Slow — requires human action during incident |
| Flapping risk | Yes — requires hysteresis thresholds | No — only changes when operator acts |
| Incident response burden | Low — operator just monitors | High — operator must diagnose and toggle manually |
| Best for | High-traffic consumer apps with defined feature tiers | Low-traffic apps or infrequent, planned maintenance |
Best Practices
feature.degraded{name=recommendations} metric on every fallback serving. Alert the on-call team — graceful degradation is not invisible; it's a signal that the system is under stress.When to Use / Avoid
| Use When | Avoid When |
|---|---|
| Consumer web app has distinct critical and non-critical features with clear business priority ordering | Financial transaction systems where partial execution produces incorrect state (partial transfers, partial trades) |
| Traffic is highly variable — e-commerce, media streaming, news sites during breaking events | All features are equally critical and no feature can be removed without breaking the core value proposition |
| Downstream dependencies have variable SLAs and can degrade unexpectedly | Safety-critical systems where reduced functionality is unacceptable (medical, aviation, industrial control) |