Health Check Design
Liveness probes restart deadlocked processes; readiness probes remove unhealthy instances from load balancers without restarting them — conflating the two causes cascading cluster-wide restarts when any downstream dependency degrades.
- Liveness probes restart deadlocked processes; readiness probes remove instances from the load balancer without restarting them. Conflating the two — using a database-connected health check as a liveness probe — causes cluster-wide cascading restarts when a downstream dependency degrades.
- Kubernetes
startupProbewithfailureThreshold=30, periodSeconds=10gives Java services up to 5 minutes to boot without triggering premature liveness-probe restarts during initialization. - A liveness probe that calls an external dependency fails when that dependency is unavailable — causing Kubernetes to restart the container, which then hammers the recovering dependency with thundering-herd reconnects from every restarted pod simultaneously.
- Spring Boot 2.3+ exposes
/actuator/health/liveness(JVM thread deadlock check only) and/actuator/health/readiness(DB connection pool check) as separate endpoints. Wire each to the correct probe type. - Cache readiness check results for 2–5 seconds. At 100 pods with a 10-second probe interval, 10 simultaneous health check requests/second hit the database — at 1-second intervals, 100 simultaneous hits can saturate the connection pool.
The Problem
A Java Spring Boot service is configured with a single livenessProbe: httpGet: /health that queries the database to verify connectivity. The primary database suffers a transient 30-second latency spike during a snapshot backup. The health check query times out, causing the liveness probe to fail. Kubernetes restarts the pod. At 50 replicas, all pods fail their liveness probes within 30 seconds — all 50 restart simultaneously. On restart, each pod immediately attempts to open a database connection pool of 10 connections — 500 new connection attempts hit the recovering database at once. The database's connection limit is 200. The thundering herd of reconnects prevents the database from recovering. A 30-second latency spike has become a 20-minute cluster-wide outage caused by the health check system itself.
Core System Idea
A robust health check architecture separates three distinct concepts: (1) Liveness probes — determine if the process is in an unrecoverable state: deadlocked threads, JVM OOM, or corrupted internal state. If the liveness probe fails, the container is restarted. The liveness probe must only check internal process state — never external dependencies. A process that cannot connect to a database is not deadlocked; it should be removed from the load balancer, not restarted. (2) Readiness probes — determine if the instance can serve traffic. If the readiness probe fails, the instance is removed from the load balancer but is not restarted. The readiness probe may check local cache warmth, database connection pool availability, or internal configuration loading. The key constraint: if all instances fail readiness simultaneously, load balancer routing is disrupted — set minReadySeconds to prevent this. (3) Startup probes — disable liveness and readiness probes during the initialization phase, preventing premature restarts before a slow-starting application is fully initialized. Configure failureThreshold × periodSeconds to match the worst-case startup time (e.g., failureThreshold=30, periodSeconds=10 = 300 seconds for Java services loading large classpath resources). Once the startup probe succeeds, it hands off to liveness and readiness probes.
System Flow
Startup, liveness, and readiness probes serve distinct functions — startup disables the others during boot, liveness triggers restarts for deadlocked processes only, readiness controls load balancer membership without restarting.
Real-World Examples Indicative
A documented failure mode at multiple companies: a liveness probe configured as httpGet: /health that calls SELECT 1 against the database. When the database is overloaded, the SELECT returns in 6 seconds (above the probe timeoutSeconds=5). Kubernetes marks the pod not-live and restarts it. With 50 replicas and a 10-second probe interval, all 50 pods fail within the same 30-second window and restart simultaneously. Spring Boot 2.3+ (released June 2020) was specifically designed to address this: /actuator/health/liveness checks only JVM thread deadlock state; /actuator/health/readiness checks DB pool availability. The standard Kubernetes pod spec wires these separately: livenessProbe: httpGet: path: /actuator/health/liveness, readinessProbe: httpGet: path: /actuator/health/readiness. The fix eliminates cascade because a slow database fails the readiness probe (pod removed from LB) but not the liveness probe (no restart triggered).
ECS task definitions configure health checks independently of ALB target group health checks. The ECS-level check: command: ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"], interval: 30, timeout: 5, retries: 3, startPeriod: 120. The startPeriod: 120 is ECS's equivalent of a startup probe — health check failures during the first 120 seconds do not count toward the retries limit. Without startPeriod, a Java service taking 90 seconds to start fails 3 consecutive checks (at 30-second intervals) and ECS marks the task as unhealthy and replaces it before it finishes booting. The ALB target group health check also runs independently: healthyThreshold=2, unhealthyThreshold=3, interval=30, timeout=5 — both must pass for traffic to reach the container.
Consul uses health check results to update its service catalog, which Envoy and other proxies use for endpoint discovery. Consul's anti-flapping mechanism: DeregisterCriticalServiceAfter: "5m" — a service that continuously fails health checks is not immediately deregistered from the catalog; it waits 5 minutes before removal. This prevents a brief network partition from removing all instances of a service from Envoy's cluster config simultaneously. Consul also supports interval: "10s" and timeout: "2s" at the check definition level — shorter than the deregistration window, so a recovering service reappears in the catalog within 10 seconds of recovery while still being protected from flapping-based deregistration.
Anti-Patterns
Querying external dependencies in liveness probes causes the orchestrator to restart healthy containers when any dependency degrades. A container that cannot connect to a database is not deadlocked — it should be drained and removed from the load balancer, not killed and restarted under thundering-herd conditions.
Configuring liveness probes with initialDelaySeconds=30 on a Java service that takes 90 seconds to load its Spring context. The liveness probe starts firing at 30 seconds, fails three times (at 30s, 40s, 50s), and Kubernetes kills the container at 50 seconds — 40 seconds before it would have been ready. Use startupProbe instead of initialDelaySeconds.
A /health endpoint that opens a new database connection on every probe request, without connection reuse. At 50 pods with 10-second probe intervals, this creates 5 new database connections per second from health checks alone — equivalent to a moderate production load generated by the monitoring system itself.
Setting probe failureThreshold=1 means a single failed probe removes an instance from the load balancer. A transient 1-second network blip causes all instances to cycle out and back in simultaneously, creating a traffic spike on the remaining instances during the 1-second window they are all unready.
Design Tradeoffs
| Dimension | Shallow Liveness Check | Deep Readiness Check |
|---|---|---|
| What it checks | Internal process state only: thread liveness, memory, deadlock | External dependency reachability: DB connection, cache warmth |
| Failure action | Container restart — use only for truly unrecoverable states | Load balancer removal — use for transient dependency unavailability |
| Cascade risk | High if external dependencies are included — restarts amplify load | Low — removal from LB does not generate thundering-herd reconnects |
| Frequency safety | Safe at 1-second intervals — purely in-process | Requires caching (2–5s) to avoid probe traffic saturating dependencies |
Best Practices
200 OK from an in-memory handler that does not touch any external resource. If in doubt, make the liveness probe return 200 unconditionally — a deadlocked process cannot respond at all.failureThreshold × periodSeconds to the worst-case startup time plus a 50% buffer. A Java service with a 60-second worst-case boot needs failureThreshold=9, periodSeconds=10 (90 seconds of startup tolerance).terminationGracePeriodSeconds) before the process exits.minReadySeconds on Deployments to prevent all pods from entering readiness-failing state simultaneously during a rolling restart — a threshold of 15–30 seconds ensures at least one pod has been healthy for a sustained window before the rollout proceeds.When to Use / Avoid
| Use When | Avoid When |
|---|---|
| Running applications in container orchestrators like Kubernetes or ECS where automated recovery is required | Running simple, single-instance legacy applications where manual restarts are the recovery mechanism |
| Operating microservices with complex, multi-stage startup and initialization phases | Serverless environments where execution lifecycles are managed entirely by the cloud provider |
| Managing dynamic load balancers that route traffic based on real-time node availability | Static, hardcoded routing environments where automated failover is not configured |