API Gateway Design
An API Gateway centralizes cross-cutting concerns — TLS termination, JWT validation, rate limiting, request routing — so downstream microservices implement none of them; the gateway becomes a single point of failure if it stores state, blocks on slow backends, or embeds business logic.
- An API Gateway centralizes TLS termination, JWT validation, rate limiting, and routing so downstream microservices implement none of these cross-cutting concerns independently.
- Netflix Zuul 2 was rewritten from Apache Tomcat (thread-per-connection) to Netty (non-blocking I/O) specifically because 100K concurrent connections required 100K threads at 512KB each — 50GB of RAM just for thread stacks. Non-blocking handles 100K connections with a CPU-sized thread pool.
- A Lambda authorizer response cached at TTL=300s at AWS API Gateway reduces authorization cost by 99.7% at 10K RPS compared to per-request invocation. Without caching, auth is $864/month in Lambda invocations at that scale.
- Gateway business logic is the single most dangerous anti-pattern: a gateway that runs database queries or complex transformations becomes the bottleneck for every team and is impossible to deploy independently.
- Use the Backend-for-Frontend (BFF) pattern — separate gateway instances for mobile, web, and third-party partners — rather than a single gateway with all three clients' conflicting routing rules.
The Problem
A company with 15 microservices finds that each team independently implements JWT validation using different libraries, with different token expiry handling, different rate limiting thresholds, and different CORS headers. The security team discovers that 4 services have different CORS policies, 2 have no rate limiting at all, and 3 handle expired tokens differently (one returns 200 with an empty body). Every mobile app client must know the address of each service and manage authentication with each independently. Adding a new service means updating every mobile client. Adding IP allowlisting requires deploying to all 15 services. Centralization is needed — but done wrong, the gateway becomes a bottleneck for every team.
Core System Idea
An API Gateway is a reverse proxy that intercepts all inbound requests and applies cross-cutting concerns before routing to downstream services. The critical design constraints that determine whether a gateway helps or creates a new bottleneck: (1) Statelessness — the gateway must hold no session state. JWT validation is done in-process using a shared public key (no network call per request). Rate limiting state lives in Redis, not in gateway memory. A stateless gateway can scale horizontally by adding instances behind a load balancer. (2) Non-blocking I/O — gateway threads must never block waiting for a slow downstream service. Gateways built on Netty (Netflix Zuul 2, Nginx, Envoy) handle 100K concurrent connections with a thread pool sized to CPU count. Thread-per-request gateways (old Zuul, Java Servlet-based) exhaust thread pools when downstream services slow down. (3) No business logic — route configuration, auth, rate limiting, and observability. Never database queries, aggregation logic, or domain validation. Business logic in the gateway creates deployment coupling between the gateway team and every downstream team. (4) Aggressive downstream timeouts — set connect timeout 50ms, read timeout 500ms–2s per downstream service. Without timeouts, one slow service exhausts the gateway's connection pool for all other routes.
System Flow
The gateway validates JWT and checks rate limits against Redis before routing to backend services — all three services share a single auth and rate-limit enforcement point.
Real-World Examples Indicative
Netflix's original Zuul used a Tomcat thread-per-request model. At 100K concurrent requests, this required 100K threads — each JVM thread consumes ~512KB of stack space, meaning 50GB of RAM consumed entirely by thread stacks before any request processing. Zuul 2 was rewritten in 2016 on Netty's non-blocking event loop: the same 100K concurrent connections are handled with a thread pool sized to the number of CPU cores (typically 16–32 threads). The thread pool processes I/O events; threads never block on downstream network calls. Zuul 2 handles Netflix's 2M+ requests/second at the ingress tier with filter chain processing: inbound filters (authentication, request logging, rate limiting), endpoint filters (routing), and outbound filters (response modification, metrics). Dynamic filter updates push new routing rules to all Zuul instances without a restart.
AWS API Gateway (REST API) supports a Lambda authorizer that validates JWTs and returns an IAM policy document. Without caching: each request triggers a Lambda invocation at $0.0000002/invocation — at 10,000 RPS, that is 864M invocations/day = $172/day ($864/month) in Lambda costs alone for authorization. With AuthorizerResultTtlInSeconds=300: the authorizer's response is cached per token for 5 minutes. A user making requests at 10 RPS generates 1 Lambda invocation per 5 minutes instead of 10 per second — 3,000× fewer invocations, reducing auth cost to $0.29/month for the same traffic. The context object in the authorizer response passes decoded JWT claims to downstream Lambda functions via event.requestContext.authorizer, eliminating re-decoding at the service layer.
Kong (Nginx + OpenResty + LuaJIT) processes rate limiting, OAuth2, and request logging inline with the Nginx event loop using Lua plugins — no separate process spawned per plugin, no cross-process IPC. The rate limiting plugin uses Redis sorted sets with an atomic Lua script to implement sliding window counting; Kong adds ~0.5–1ms of gateway processing overhead per request at 1M+ RPM production deployments. Kong's declarative configuration (decK) treats route, plugin, and consumer configurations as YAML files in git, synchronized to the Kong database via CI/CD — the gateway configuration is version-controlled and deployed with the same GitOps workflow as application code.
Anti-Patterns
Writing database queries, complex data transformations, or domain validation inside gateway filter code. The gateway processes every request — a slow database query in a gateway filter adds latency to every route, not just the routes that need that data. Gateway code cannot be deployed or rolled back independently of the routing configuration.
Failing to set connect and read timeouts for downstream services. A payment service that degrades to 30-second responses holds gateway connections for 30 seconds each. At 1,000 concurrent users, 1,000 gateway connections are blocked, starving all other routes.
Using a single gateway configuration for web, mobile, and third-party API partners. Mobile clients need smaller payloads and different error codes than web clients; third-party partners need API versioning and stricter rate limits. A single gateway serves none of these well. Use the Backend-for-Frontend (BFF) pattern: separate gateway instances per client type, each optimized for its consumer.
Aggregating responses from five services synchronously in the gateway before returning to the client. If any one of the five services fails or is slow, the entire response fails or is slow. Failure rate becomes the sum of all downstream failure rates; latency becomes the sum of all downstream latencies.
Design Tradeoffs
| Dimension | Centralized API Gateway | Direct Client-to-Service |
|---|---|---|
| Cross-cutting concerns | Auth, rate limiting, logging implemented once — consistent across all services | Each service implements independently; inevitable divergence in behavior |
| Latency | Extra network hop: 0.5–2ms gateway processing overhead | Zero gateway overhead; direct connection |
| SPOF risk | Gateway must be deployed multi-zone HA; single misconfiguration affects all routes | No gateway SPOF; each service is individually exposed with its own failure domain |
| Team autonomy | Gateway config is shared; changes require gateway team coordination | Teams deploy and configure their own endpoints completely independently |
Best Practices
When to Use / Avoid
| Use When | Avoid When |
|---|---|
| Running a microservices architecture with multiple public-facing services requiring consistent auth, rate limiting, and CORS enforcement | Running a simple monolithic architecture where a single process handles all client types — a gateway adds latency with no organizational benefit |
| Supporting multiple client types (mobile, web, IoT) with different payload and authentication requirements | Ultra-low-latency applications where even 1ms of gateway overhead violates the SLA — use direct TCP or co-located service mesh instead |
| Security teams need a single enforcement point for auth policies, IP allowlisting, and TLS configuration | Engineering team lacks the operational capacity to monitor and scale an additional infrastructure tier |