Health Check Endpoints — What a Proper /health Looks Like
Picture a bouncer at a busy club's entrance. A "liveness" health check is like asking the bouncer, "Are you alive?" — a quick, superficial check. A "readiness" check is like asking, "Are you ready to let people in?" — which might involve checking the coat room and bar. Running a deep DB query for liveness is like asking the bouncer to fetch a drink for every person trying to get in.
The Setup
You build a /health endpoint for Kubernetes to monitor your service. Every 5 seconds, the probe checks the database to verify status under simulated concurrency.
What Does This Print?
import asyncio
class DatabasePool:
def __init__(self):
self.available_connections = 1
async def query(self):
if self.available_connections <= 0:
raise RuntimeError("Timeout: No database connections available!")
self.available_connections -= 1
await asyncio.sleep(0.5) # Simulate long query
self.available_connections += 1
db_pool = DatabasePool()
async def check_health():
try:
# Naive deep probe runs actual DB query on every orchestrator ping
await db_pool.query()
return "200 OK"
except Exception as e:
return f"500 Internal Error: {e}"
async def handle_traffic_surge():
# High concurrent request load and Kubernetes health-probe collide
results = await asyncio.gather(
db_pool.query(), # Real user traffic consuming DB connection
check_health() # Orchestrator health check running concurrently
)
print(f"Results: {results}")
asyncio.run(handle_traffic_surge())
The Output
The database pool was depleted by traffic, causing the health check probe to instantly fail and return a 500 error. When orchestration systems receive this failure, they flag the instance as dead and restart it, causing a cascading outage that destroys active client operations.
Why Python Does This
Python's asynchronous tasks execute on a single event loop. If your health checks execute deep database transactions directly, they compete for resource slots in your application's connection pool. If client requests saturate this pool, the liveness check probe blocks, times out, and flags the service as unhealthy. Instead of deep system calls on every ping, health checks should differentiate between liveness (is the event loop running?) and readiness (is the DB connected?), and cache connection health state periodically rather than running deep pings synchronously.
The Fix
import asyncio
class DatabasePool:
def __init__(self):
self.available_connections = 1
async def query(self):
if self.available_connections <= 0:
raise RuntimeError("Timeout: No database connections available!")
self.available_connections -= 1
await asyncio.sleep(0.5)
self.available_connections += 1
db_pool = DatabasePool()
# Fix: Keep a cached check status rather than running heavy queries synchronously
db_healthy_cache = True
async def update_health_cache():
global db_healthy_cache
while True:
try:
await db_pool.query()
db_healthy_cache = True
except Exception:
db_healthy_cache = False
await asyncio.sleep(10) # Run asynchronously in background
async def check_health():
# Read the cached status immediately without blocking connection pools
if db_healthy_cache:
return "200 OK"
return "500 Internal Error: DB unavailable"
async def handle_traffic_surge():
results = await asyncio.gather(
db_pool.query(),
check_health()
)
print(f"Results: {results}")
asyncio.run(handle_traffic_surge())
Separating liveness and readiness probes, and keeping liveness checks lightweight (e.g., checking only internal state or a cached result), prevents health checks from competing for critical resources. This ensures the application doesn't self-destruct under load by failing its own probes, allowing orchestrators to correctly manage traffic.
How This Fails in Real Systems
During a promotional sale event, a Kubernetes cluster experienced rolling restarts of its API pods. The heavy load saturated the database connections, causing liveness probes to time out. The cluster terminated healthy pods, causing complete site downtime for 45 minutes until deep pinging was removed.