← Python Code Deployment & Monitoring
Browse Python Concepts

Health Check Endpoints — What a Proper /health Looks Like

Mental Model

Picture a bouncer at a busy club's entrance. A "liveness" health check is like asking the bouncer, "Are you alive?" — a quick, superficial check. A "readiness" check is like asking, "Are you ready to let people in?" — which might involve checking the coat room and bar. Running a deep DB query for liveness is like asking the bouncer to fetch a drink for every person trying to get in.

Rule: When designing health checks, never execute blocking, synchronous, or resource-heavy external calls directly within liveness probes.

The Setup

You build a /health endpoint for Kubernetes to monitor your service. Every 5 seconds, the probe checks the database to verify status under simulated concurrency.

What Does This Print?

Broken code
Python
import asyncio

class DatabasePool:
    def __init__(self):
        self.available_connections = 1

    async def query(self):
        if self.available_connections <= 0:
            raise RuntimeError("Timeout: No database connections available!")
        self.available_connections -= 1
        await asyncio.sleep(0.5) # Simulate long query
        self.available_connections += 1

db_pool = DatabasePool()

async def check_health():
    try:
        # Naive deep probe runs actual DB query on every orchestrator ping
        await db_pool.query()
        return "200 OK"
    except Exception as e:
        return f"500 Internal Error: {e}"

async def handle_traffic_surge():
    # High concurrent request load and Kubernetes health-probe collide
    results = await asyncio.gather(
        db_pool.query(),  # Real user traffic consuming DB connection
        check_health()     # Orchestrator health check running concurrently
    )
    print(f"Results: {results}")

asyncio.run(handle_traffic_surge())
Predict what happens to the /health check response when database resource contention occurs.

The Output

What actually happens
Results: [None, '500 Internal Error: Timeout: No database connections available!']

The database pool was depleted by traffic, causing the health check probe to instantly fail and return a 500 error. When orchestration systems receive this failure, they flag the instance as dead and restart it, causing a cascading outage that destroys active client operations.

Why Python Does This

Python's asynchronous tasks execute on a single event loop. If your health checks execute deep database transactions directly, they compete for resource slots in your application's connection pool. If client requests saturate this pool, the liveness check probe blocks, times out, and flags the service as unhealthy. Instead of deep system calls on every ping, health checks should differentiate between liveness (is the event loop running?) and readiness (is the DB connected?), and cache connection health state periodically rather than running deep pings synchronously.

The Fix

Corrected pattern
Python
import asyncio

class DatabasePool:
    def __init__(self):
        self.available_connections = 1

    async def query(self):
        if self.available_connections <= 0:
            raise RuntimeError("Timeout: No database connections available!")
        self.available_connections -= 1
        await asyncio.sleep(0.5)
        self.available_connections += 1

db_pool = DatabasePool()
# Fix: Keep a cached check status rather than running heavy queries synchronously
db_healthy_cache = True

async def update_health_cache():
    global db_healthy_cache
    while True:
        try:
            await db_pool.query()
            db_healthy_cache = True
        except Exception:
            db_healthy_cache = False
        await asyncio.sleep(10) # Run asynchronously in background

async def check_health():
    # Read the cached status immediately without blocking connection pools
    if db_healthy_cache:
        return "200 OK"
    return "500 Internal Error: DB unavailable"

async def handle_traffic_surge():
    results = await asyncio.gather(
        db_pool.query(),
        check_health()
    )
    print(f"Results: {results}")

asyncio.run(handle_traffic_surge())

Separating liveness and readiness probes, and keeping liveness checks lightweight (e.g., checking only internal state or a cached result), prevents health checks from competing for critical resources. This ensures the application doesn't self-destruct under load by failing its own probes, allowing orchestrators to correctly manage traffic.

How This Fails in Real Systems

During a promotional sale event, a Kubernetes cluster experienced rolling restarts of its API pods. The heavy load saturated the database connections, causing liveness probes to time out. The cluster terminated healthy pods, causing complete site downtime for 45 minutes until deep pinging was removed.

Key Takeaway

When designing health checks, never execute blocking, synchronous, or resource-heavy external calls directly within liveness probes.
Common mistake: Developers implement "deep" health checks that execute resource-intensive operations (like a full database query) on every probe, mistakenly believing this provides more accurate health status, but instead creating a self-inflicted denial-of-service risk under high load.