← System Design Backend Architectures
System Design

Load Balancing Architecture

Layer 4 (L4) load balancers route traffic at the transport layer (IP/TCP) with extreme throughput and low CPU usage.

TL;DR
  • Layer 4 (L4) load balancers route traffic at the transport layer (IP/TCP) with extreme throughput and low CPU usage.
  • Layer 7 (L7) load balancers inspect application-layer data (HTTP headers, cookies, paths) for intelligent routing and SSL termination.
  • Sticky sessions introduce cascading failure risks; stateless backends with distributed session stores are preferred.
  • Active-active setups with DNS-level routing prevent the load balancer itself from becoming a single point of failure (SPOF).

The Problem

As application traffic grows, a single server instance inevitably runs out of CPU, memory, or network bandwidth. Simply scaling up the server (vertical scaling) hits a hard physical limit and introduces a single point of failure. When you scale out (horizontal scaling) to multiple servers, you face a new challenge: how do you distribute incoming client requests evenly across these servers? Without an intelligent, highly available load balancing architecture, some servers will sit idle while others crash under heavy load, and deployments will drop active user connections, causing visible errors.

Core System Idea

A load balancing architecture acts as the traffic cop of your infrastructure, distributing incoming client requests across a pool of healthy backend servers.

To scale to millions of concurrent connections, modern architectures use a multi-tiered approach. At the edge, a Layer 4 (L4) Load Balancer (operating at the TCP/UDP layer) receives the initial traffic. It is extremely fast and CPU-efficient because it only inspects IP addresses and TCP ports, routing packets directly to a pool of Layer 7 (L7) Load Balancers (operating at the HTTP/HTTPS layer).

The L7 load balancers terminate SSL/TLS connections, inspect HTTP headers, cookies, and paths, and make intelligent routing decisions (e.g., routing /api/orders to the Order Service and /static/* to an object store). They also perform continuous health checks on backend instances, automatically removing dead nodes from the rotation.

System Flow

flowchart TD Client[Client Traffic] --> DNS[Anycast DNS] DNS -- "Route to Nearest" --> L4[L4 Load Balancer] L4 -- "TCP Round Robin" --> L7_A[L7 Load Balancer A] L4 -- "TCP Round Robin" --> L7_B[L7 Load Balancer B] L7_A -- "Least Connections" --> App1[App Instance 1] L7_A -- "Least Connections" --> App2[App Instance 2] L7_B -- "Graceful Drain" --> App3[App Instance 3]

A multi-tiered load balancing architecture routes traffic from DNS through high-throughput L4 balancers to intelligent L7 balancers, which distribute requests to backend instances.

Real-World Examples Indicative

AWS NLB + ALB Tiered Architecture

AWS Network Load Balancer (NLB) operates at L4 with sub-100μs latency in DSR (Direct Server Return) mode, forwarding TCP packets at line rate without terminating connections. AWS ALB adds ~5ms overhead for TLS termination but enables HTTP/2 multiplexing, gRPC health checks, and path-based routing (/api/* → ECS task group, /static/* → S3 bucket). Large Kubernetes deployments on EKS use NLB as the outer tier routing to ALB Ingress Controllers, separating high-throughput L4 from intelligent L7 routing.

HAProxy at Stack Overflow

Stack Overflow serves 1.5+ billion page views/month through two HAProxy 2.x instances in active-passive configuration with keepalived VRRP. The primary holds a virtual IP and the standby takes over in under 1 second on failure. HAProxy is tuned with maxconn 60000 and per-server weights proportional to hardware capability—Dell R730xd nodes with 384GB RAM are assigned 4× the weight of lighter servers, so the least-connections algorithm naturally sends more traffic to the more capable hardware.

Cloudflare Unimog (XDP/eBPF L4)

Cloudflare's internal L4 load balancer "Unimog" uses XDP (eXpress Data Path) and eBPF programs loaded directly into the NIC driver to forward packets at ~1 million packets/second per CPU core, bypassing the Linux kernel network stack entirely. This achieves 100Gbps+ throughput on commodity servers. Consistent hashing on the 5-tuple (src IP, src port, dst IP, dst port, protocol) ensures the same TCP connection always reaches the same L7 backend without a centralized session table.

Anti-Patterns

Sticky Sessions (Session Affinity)

Binding a user's session to a specific backend server. If that server dies, the user loses their session data; if one server gets hot, it causes uneven load that cannot self-correct.

Neglecting Connection Draining

Shutting down backend instances during rolling deployments without allowing active connections to finish processing, resulting in dropped requests and user-facing errors.

Single Load Balancer Instance (SPOF)

Running a single load balancer instance without a hot-standby or active-active peer, meaning a single hardware or software failure takes down the entire application.

Aggressive Health Checks

Configuring health checks to run too frequently (e.g., every 500ms) with heavy database queries, effectively DDoS-ing your own backend servers.

Design Tradeoffs

DimensionLayer 4 Load BalancingLayer 7 Load Balancing
ThroughputLine-rate packet forwarding; Cloudflare Unimog handles 100Gbps+ using XDP/eBPF, bypassing the kernel network stackLower throughput; requires HTTP parse and TLS handshake per connection, adding ~5ms overhead per request
Routing intelligenceRoutes on IP and port only; no awareness of HTTP path, headers, cookies, gRPC methods, or application contentFull HTTP/2, gRPC, path, header, cookie, and query string routing; enables A/B testing and canary splits
SSL terminationPass-through only; TLS session reaches backend servers unchanged, requiring each server to hold the certificateTerminates TLS centrally; enables centralized certificate management, HTTP/2 multiplexing, and header injection

Best Practices

Use Least-Connections RoutingFor variable-latency workloads (e.g., requests that take varying times to process), use least-connections instead of round-robin to prevent overloading specific servers.
Implement Connection DrainingConfigure a draining timeout (typically 30-90 seconds) in your CI/CD pipeline to allow active requests to complete gracefully before terminating an instance.
Deploy Active-Active with KeepalivedRun multiple load balancer instances in active-active configuration using Anycast DNS or VRRP to ensure instant failover without manual intervention.
Offload SSL/TLS TerminationTerminate SSL/TLS at the L7 load balancer to relieve backend application servers of the heavy cryptographic CPU overhead.
Keep Health Checks LightweightUse a dedicated, fast endpoint (e.g., /healthz) that returns a simple 200 OK without performing expensive database joins or third-party API calls.

When to Use / Avoid

Use WhenAvoid When
You run horizontally scaled backend applications that need even traffic distribution and high availability.You run a single, monolithic server instance that can easily handle your entire traffic volume.
You need to perform zero-downtime rolling deployments by gracefully routing traffic away from terminating nodes.Your application has strict, stateful in-memory requirements that cannot be externalized to a shared cache.
You need to route traffic intelligently based on URL paths, headers, or client device types.You are building a simple, internal-only tool with minimal traffic and no high-availability requirements.