← System Design Reliability Engineering
System Design

Health Check Design

Liveness probes restart deadlocked processes; readiness probes remove unhealthy instances from load balancers without restarting them — conflating the two causes cascading cluster-wide restarts when any downstream dependency degrades.

TL;DR
  • Liveness probes restart deadlocked processes; readiness probes remove instances from the load balancer without restarting them. Conflating the two — using a database-connected health check as a liveness probe — causes cluster-wide cascading restarts when a downstream dependency degrades.
  • Kubernetes startupProbe with failureThreshold=30, periodSeconds=10 gives Java services up to 5 minutes to boot without triggering premature liveness-probe restarts during initialization.
  • A liveness probe that calls an external dependency fails when that dependency is unavailable — causing Kubernetes to restart the container, which then hammers the recovering dependency with thundering-herd reconnects from every restarted pod simultaneously.
  • Spring Boot 2.3+ exposes /actuator/health/liveness (JVM thread deadlock check only) and /actuator/health/readiness (DB connection pool check) as separate endpoints. Wire each to the correct probe type.
  • Cache readiness check results for 2–5 seconds. At 100 pods with a 10-second probe interval, 10 simultaneous health check requests/second hit the database — at 1-second intervals, 100 simultaneous hits can saturate the connection pool.

The Problem

A Java Spring Boot service is configured with a single livenessProbe: httpGet: /health that queries the database to verify connectivity. The primary database suffers a transient 30-second latency spike during a snapshot backup. The health check query times out, causing the liveness probe to fail. Kubernetes restarts the pod. At 50 replicas, all pods fail their liveness probes within 30 seconds — all 50 restart simultaneously. On restart, each pod immediately attempts to open a database connection pool of 10 connections — 500 new connection attempts hit the recovering database at once. The database's connection limit is 200. The thundering herd of reconnects prevents the database from recovering. A 30-second latency spike has become a 20-minute cluster-wide outage caused by the health check system itself.

Core System Idea

A robust health check architecture separates three distinct concepts: (1) Liveness probes — determine if the process is in an unrecoverable state: deadlocked threads, JVM OOM, or corrupted internal state. If the liveness probe fails, the container is restarted. The liveness probe must only check internal process state — never external dependencies. A process that cannot connect to a database is not deadlocked; it should be removed from the load balancer, not restarted. (2) Readiness probes — determine if the instance can serve traffic. If the readiness probe fails, the instance is removed from the load balancer but is not restarted. The readiness probe may check local cache warmth, database connection pool availability, or internal configuration loading. The key constraint: if all instances fail readiness simultaneously, load balancer routing is disrupted — set minReadySeconds to prevent this. (3) Startup probes — disable liveness and readiness probes during the initialization phase, preventing premature restarts before a slow-starting application is fully initialized. Configure failureThreshold × periodSeconds to match the worst-case startup time (e.g., failureThreshold=30, periodSeconds=10 = 300 seconds for Java services loading large classpath resources). Once the startup probe succeeds, it hands off to liveness and readiness probes.

System Flow

flowchart TD A["Orchestrator Probe"] --> B{"Probe Type?"} B -- "Startup" --> C{"App Booted?"} C -- "No" --> D["Wait and Disable Other Probes"] C -- "Yes" --> E["Enable Liveness and Readiness"] B -- "Liveness" --> F{"Process Healthy?"} F -- "No" --> G["Restart Container"] F -- "Yes" --> H["Keep Running"] B -- "Readiness" --> I{"Ready for Traffic?"} I -- "No" --> J["Remove from Load Balancer"] I -- "Yes" --> K["Route Traffic"]

Startup, liveness, and readiness probes serve distinct functions — startup disables the others during boot, liveness triggers restarts for deadlocked processes only, readiness controls load balancer membership without restarting.

Real-World Examples Indicative

Kubernetes liveness-causes-cascade incident pattern

A documented failure mode at multiple companies: a liveness probe configured as httpGet: /health that calls SELECT 1 against the database. When the database is overloaded, the SELECT returns in 6 seconds (above the probe timeoutSeconds=5). Kubernetes marks the pod not-live and restarts it. With 50 replicas and a 10-second probe interval, all 50 pods fail within the same 30-second window and restart simultaneously. Spring Boot 2.3+ (released June 2020) was specifically designed to address this: /actuator/health/liveness checks only JVM thread deadlock state; /actuator/health/readiness checks DB pool availability. The standard Kubernetes pod spec wires these separately: livenessProbe: httpGet: path: /actuator/health/liveness, readinessProbe: httpGet: path: /actuator/health/readiness. The fix eliminates cascade because a slow database fails the readiness probe (pod removed from LB) but not the liveness probe (no restart triggered).

AWS ECS health check configuration specifics

ECS task definitions configure health checks independently of ALB target group health checks. The ECS-level check: command: ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"], interval: 30, timeout: 5, retries: 3, startPeriod: 120. The startPeriod: 120 is ECS's equivalent of a startup probe — health check failures during the first 120 seconds do not count toward the retries limit. Without startPeriod, a Java service taking 90 seconds to start fails 3 consecutive checks (at 30-second intervals) and ECS marks the task as unhealthy and replaces it before it finishes booting. The ALB target group health check also runs independently: healthyThreshold=2, unhealthyThreshold=3, interval=30, timeout=5 — both must pass for traffic to reach the container.

Consul service registry with flapping protection

Consul uses health check results to update its service catalog, which Envoy and other proxies use for endpoint discovery. Consul's anti-flapping mechanism: DeregisterCriticalServiceAfter: "5m" — a service that continuously fails health checks is not immediately deregistered from the catalog; it waits 5 minutes before removal. This prevents a brief network partition from removing all instances of a service from Envoy's cluster config simultaneously. Consul also supports interval: "10s" and timeout: "2s" at the check definition level — shorter than the deregistration window, so a recovering service reappears in the catalog within 10 seconds of recovery while still being protected from flapping-based deregistration.

Anti-Patterns

Database queries inside liveness probes

Querying external dependencies in liveness probes causes the orchestrator to restart healthy containers when any dependency degrades. A container that cannot connect to a database is not deadlocked — it should be drained and removed from the load balancer, not killed and restarted under thundering-herd conditions.

No startup probe on slow-initializing services

Configuring liveness probes with initialDelaySeconds=30 on a Java service that takes 90 seconds to load its Spring context. The liveness probe starts firing at 30 seconds, fails three times (at 30s, 40s, 50s), and Kubernetes kills the container at 50 seconds — 40 seconds before it would have been ready. Use startupProbe instead of initialDelaySeconds.

Unprotected health check endpoints

A /health endpoint that opens a new database connection on every probe request, without connection reuse. At 50 pods with 10-second probe intervals, this creates 5 new database connections per second from health checks alone — equivalent to a moderate production load generated by the monitoring system itself.

Flapping readiness with no `minReadySeconds`

Setting probe failureThreshold=1 means a single failed probe removes an instance from the load balancer. A transient 1-second network blip causes all instances to cycle out and back in simultaneously, creating a traffic spike on the remaining instances during the 1-second window they are all unready.

Design Tradeoffs

DimensionShallow Liveness CheckDeep Readiness Check
What it checksInternal process state only: thread liveness, memory, deadlockExternal dependency reachability: DB connection, cache warmth
Failure actionContainer restart — use only for truly unrecoverable statesLoad balancer removal — use for transient dependency unavailability
Cascade riskHigh if external dependencies are included — restarts amplify loadLow — removal from LB does not generate thundering-herd reconnects
Frequency safetySafe at 1-second intervals — purely in-processRequires caching (2–5s) to avoid probe traffic saturating dependencies

Best Practices

Wire liveness probes only to in-process checks: thread deadlock detection, JVM OOM state, or a simple 200 OK from an in-memory handler that does not touch any external resource. If in doubt, make the liveness probe return 200 unconditionally — a deadlocked process cannot respond at all.
Use startup probes for any service with initialization longer than 30 seconds. Set failureThreshold × periodSeconds to the worst-case startup time plus a 50% buffer. A Java service with a 60-second worst-case boot needs failureThreshold=9, periodSeconds=10 (90 seconds of startup tolerance).
Cache readiness check results for 2–5 seconds when the check involves an external call. At 100 pods with a 10-second probe interval, uncached checks generate 10 database queries/second from health infrastructure alone.
Implement graceful shutdown: on SIGTERM, immediately fail the readiness probe (pod removed from load balancer, new requests stop arriving) while keeping the liveness probe passing. Wait for in-flight requests to drain (terminationGracePeriodSeconds) before the process exits.
Set minReadySeconds on Deployments to prevent all pods from entering readiness-failing state simultaneously during a rolling restart — a threshold of 15–30 seconds ensures at least one pod has been healthy for a sustained window before the rollout proceeds.

When to Use / Avoid

Use WhenAvoid When
Running applications in container orchestrators like Kubernetes or ECS where automated recovery is requiredRunning simple, single-instance legacy applications where manual restarts are the recovery mechanism
Operating microservices with complex, multi-stage startup and initialization phasesServerless environments where execution lifecycles are managed entirely by the cloud provider
Managing dynamic load balancers that route traffic based on real-time node availabilityStatic, hardcoded routing environments where automated failover is not configured