Load Balancing Architecture
Layer 4 (L4) load balancers route traffic at the transport layer (IP/TCP) with extreme throughput and low CPU usage.
- Layer 4 (L4) load balancers route traffic at the transport layer (IP/TCP) with extreme throughput and low CPU usage.
- Layer 7 (L7) load balancers inspect application-layer data (HTTP headers, cookies, paths) for intelligent routing and SSL termination.
- Sticky sessions introduce cascading failure risks; stateless backends with distributed session stores are preferred.
- Active-active setups with DNS-level routing prevent the load balancer itself from becoming a single point of failure (SPOF).
The Problem
As application traffic grows, a single server instance inevitably runs out of CPU, memory, or network bandwidth. Simply scaling up the server (vertical scaling) hits a hard physical limit and introduces a single point of failure. When you scale out (horizontal scaling) to multiple servers, you face a new challenge: how do you distribute incoming client requests evenly across these servers? Without an intelligent, highly available load balancing architecture, some servers will sit idle while others crash under heavy load, and deployments will drop active user connections, causing visible errors.
Core System Idea
A load balancing architecture acts as the traffic cop of your infrastructure, distributing incoming client requests across a pool of healthy backend servers.
To scale to millions of concurrent connections, modern architectures use a multi-tiered approach. At the edge, a Layer 4 (L4) Load Balancer (operating at the TCP/UDP layer) receives the initial traffic. It is extremely fast and CPU-efficient because it only inspects IP addresses and TCP ports, routing packets directly to a pool of Layer 7 (L7) Load Balancers (operating at the HTTP/HTTPS layer).
The L7 load balancers terminate SSL/TLS connections, inspect HTTP headers, cookies, and paths, and make intelligent routing decisions (e.g., routing /api/orders to the Order Service and /static/* to an object store). They also perform continuous health checks on backend instances, automatically removing dead nodes from the rotation.
System Flow
A multi-tiered load balancing architecture routes traffic from DNS through high-throughput L4 balancers to intelligent L7 balancers, which distribute requests to backend instances.
Real-World Examples Indicative
AWS Network Load Balancer (NLB) operates at L4 with sub-100μs latency in DSR (Direct Server Return) mode, forwarding TCP packets at line rate without terminating connections. AWS ALB adds ~5ms overhead for TLS termination but enables HTTP/2 multiplexing, gRPC health checks, and path-based routing (/api/* → ECS task group, /static/* → S3 bucket). Large Kubernetes deployments on EKS use NLB as the outer tier routing to ALB Ingress Controllers, separating high-throughput L4 from intelligent L7 routing.
Stack Overflow serves 1.5+ billion page views/month through two HAProxy 2.x instances in active-passive configuration with keepalived VRRP. The primary holds a virtual IP and the standby takes over in under 1 second on failure. HAProxy is tuned with maxconn 60000 and per-server weights proportional to hardware capability—Dell R730xd nodes with 384GB RAM are assigned 4× the weight of lighter servers, so the least-connections algorithm naturally sends more traffic to the more capable hardware.
Cloudflare's internal L4 load balancer "Unimog" uses XDP (eXpress Data Path) and eBPF programs loaded directly into the NIC driver to forward packets at ~1 million packets/second per CPU core, bypassing the Linux kernel network stack entirely. This achieves 100Gbps+ throughput on commodity servers. Consistent hashing on the 5-tuple (src IP, src port, dst IP, dst port, protocol) ensures the same TCP connection always reaches the same L7 backend without a centralized session table.
Anti-Patterns
Binding a user's session to a specific backend server. If that server dies, the user loses their session data; if one server gets hot, it causes uneven load that cannot self-correct.
Shutting down backend instances during rolling deployments without allowing active connections to finish processing, resulting in dropped requests and user-facing errors.
Running a single load balancer instance without a hot-standby or active-active peer, meaning a single hardware or software failure takes down the entire application.
Configuring health checks to run too frequently (e.g., every 500ms) with heavy database queries, effectively DDoS-ing your own backend servers.
Design Tradeoffs
| Dimension | Layer 4 Load Balancing | Layer 7 Load Balancing |
|---|---|---|
| Throughput | Line-rate packet forwarding; Cloudflare Unimog handles 100Gbps+ using XDP/eBPF, bypassing the kernel network stack | Lower throughput; requires HTTP parse and TLS handshake per connection, adding ~5ms overhead per request |
| Routing intelligence | Routes on IP and port only; no awareness of HTTP path, headers, cookies, gRPC methods, or application content | Full HTTP/2, gRPC, path, header, cookie, and query string routing; enables A/B testing and canary splits |
| SSL termination | Pass-through only; TLS session reaches backend servers unchanged, requiring each server to hold the certificate | Terminates TLS centrally; enables centralized certificate management, HTTP/2 multiplexing, and header injection |
Best Practices
/healthz) that returns a simple 200 OK without performing expensive database joins or third-party API calls.When to Use / Avoid
| Use When | Avoid When |
|---|---|
| You run horizontally scaled backend applications that need even traffic distribution and high availability. | You run a single, monolithic server instance that can easily handle your entire traffic volume. |
| You need to perform zero-downtime rolling deployments by gracefully routing traffic away from terminating nodes. | Your application has strict, stateful in-memory requirements that cannot be externalized to a shared cache. |
| You need to route traffic intelligently based on URL paths, headers, or client device types. | You are building a simple, internal-only tool with minimal traffic and no high-availability requirements. |