Backend Architectures · #29

API Gateway Design

An API Gateway centralizes cross-cutting concerns — TLS termination, JWT validation, rate limiting, request routing — so downstream microservices implement none of them; the gateway becomes a single point of failure if it stores state, blocks on slow backends, or embeds business logic.

Published May 29, 2026 · By MortalApps · 5 min read · 965 words

TL;DR

An API Gateway centralizes TLS termination, JWT validation, rate limiting, and routing so downstream microservices implement none of these cross-cutting concerns independently.
Netflix Zuul 2 was rewritten from Apache Tomcat (thread-per-connection) to Netty (non-blocking I/O) specifically because 100K concurrent connections required 100K threads at 512KB each — 50GB of RAM just for thread stacks. Non-blocking handles 100K connections with a CPU-sized thread pool.
A Lambda authorizer response cached at TTL=300s at AWS API Gateway reduces authorization cost by 99.7% at 10K RPS compared to per-request invocation. Without caching, auth is $864/month in Lambda invocations at that scale.
Gateway business logic is the single most dangerous anti-pattern: a gateway that runs database queries or complex transformations becomes the bottleneck for every team and is impossible to deploy independently.
Use the Backend-for-Frontend (BFF) pattern — separate gateway instances for mobile, web, and third-party partners — rather than a single gateway with all three clients' conflicting routing rules.

Problem Idea Flow Examples Anti-patterns Tradeoffs Best Practices Related

The Problem

A company with 15 microservices finds that each team independently implements JWT validation using different libraries, with different token expiry handling, different rate limiting thresholds, and different CORS headers. The security team discovers that 4 services have different CORS policies, 2 have no rate limiting at all, and 3 handle expired tokens differently (one returns 200 with an empty body). Every mobile app client must know the address of each service and manage authentication with each independently. Adding a new service means updating every mobile client. Adding IP allowlisting requires deploying to all 15 services. Centralization is needed — but done wrong, the gateway becomes a bottleneck for every team.

Core System Idea

An API Gateway is a reverse proxy that intercepts all inbound requests and applies cross-cutting concerns before routing to downstream services. The critical design constraints that determine whether a gateway helps or creates a new bottleneck: (1) Statelessness — the gateway must hold no session state. JWT validation is done in-process using a shared public key (no network call per request). Rate limiting state lives in Redis, not in gateway memory. A stateless gateway can scale horizontally by adding instances behind a load balancer. (2) Non-blocking I/O — gateway threads must never block waiting for a slow downstream service. Gateways built on Netty (Netflix Zuul 2, Nginx, Envoy) handle 100K concurrent connections with a thread pool sized to CPU count. Thread-per-request gateways (old Zuul, Java Servlet-based) exhaust thread pools when downstream services slow down. (3) No business logic — route configuration, auth, rate limiting, and observability. Never database queries, aggregation logic, or domain validation. Business logic in the gateway creates deployment coupling between the gateway team and every downstream team. (4) Aggressive downstream timeouts — set connect timeout 50ms, read timeout 500ms–2s per downstream service. Without timeouts, one slow service exhausts the gateway's connection pool for all other routes.

System Flow

flowchart TD A["Client Request"] --> B["API Gateway"] B --> C["Auth and Rate Limit Check"] C --> D["Redis Cache"] D --> C C -- "Allowed" --> E["Route Request"] E --> F["User Service"] E --> G["Order Service"] E --> H["Inventory Service"] C -- "Rejected" --> I["401 or 429 Response"]

The gateway validates JWT and checks rate limits against Redis before routing to backend services — all three services share a single auth and rate-limit enforcement point.

Real-World Examples Indicative

Netflix Zuul 2 — thread model rewrite

Netflix's original Zuul used a Tomcat thread-per-request model. At 100K concurrent requests, this required 100K threads — each JVM thread consumes ~512KB of stack space, meaning 50GB of RAM consumed entirely by thread stacks before any request processing. Zuul 2 was rewritten in 2016 on Netty's non-blocking event loop: the same 100K concurrent connections are handled with a thread pool sized to the number of CPU cores (typically 16–32 threads). The thread pool processes I/O events; threads never block on downstream network calls. Zuul 2 handles Netflix's 2M+ requests/second at the ingress tier with filter chain processing: inbound filters (authentication, request logging, rate limiting), endpoint filters (routing), and outbound filters (response modification, metrics). Dynamic filter updates push new routing rules to all Zuul instances without a restart.

AWS API Gateway with cached Lambda authorizer

AWS API Gateway (REST API) supports a Lambda authorizer that validates JWTs and returns an IAM policy document. Without caching: each request triggers a Lambda invocation at $0.0000002/invocation — at 10,000 RPS, that is 864M invocations/day = $172/day ($864/month) in Lambda costs alone for authorization. With AuthorizerResultTtlInSeconds=300: the authorizer's response is cached per token for 5 minutes. A user making requests at 10 RPS generates 1 Lambda invocation per 5 minutes instead of 10 per second — 3,000× fewer invocations, reducing auth cost to $0.29/month for the same traffic. The context object in the authorizer response passes decoded JWT claims to downstream Lambda functions via event.requestContext.authorizer, eliminating re-decoding at the service layer.

Kong Gateway plugin architecture

Kong (Nginx + OpenResty + LuaJIT) processes rate limiting, OAuth2, and request logging inline with the Nginx event loop using Lua plugins — no separate process spawned per plugin, no cross-process IPC. The rate limiting plugin uses Redis sorted sets with an atomic Lua script to implement sliding window counting; Kong adds ~0.5–1ms of gateway processing overhead per request at 1M+ RPM production deployments. Kong's declarative configuration (decK) treats route, plugin, and consumer configurations as YAML files in git, synchronized to the Kong database via CI/CD — the gateway configuration is version-controlled and deployed with the same GitOps workflow as application code.

Anti-Patterns

Business logic in the gateway

Writing database queries, complex data transformations, or domain validation inside gateway filter code. The gateway processes every request — a slow database query in a gateway filter adds latency to every route, not just the routes that need that data. Gateway code cannot be deployed or rolled back independently of the routing configuration.

No downstream timeouts

Failing to set connect and read timeouts for downstream services. A payment service that degrades to 30-second responses holds gateway connections for 30 seconds each. At 1,000 concurrent users, 1,000 gateway connections are blocked, starving all other routes.

Monolithic gateway for all clients

Using a single gateway configuration for web, mobile, and third-party API partners. Mobile clients need smaller payloads and different error codes than web clients; third-party partners need API versioning and stricter rate limits. A single gateway serves none of these well. Use the Backend-for-Frontend (BFF) pattern: separate gateway instances per client type, each optimized for its consumer.

Synchronous fan-out aggregation

Aggregating responses from five services synchronously in the gateway before returning to the client. If any one of the five services fails or is slow, the entire response fails or is slow. Failure rate becomes the sum of all downstream failure rates; latency becomes the sum of all downstream latencies.

Design Tradeoffs

Dimension	Centralized API Gateway	Direct Client-to-Service
Cross-cutting concerns	Auth, rate limiting, logging implemented once — consistent across all services	Each service implements independently; inevitable divergence in behavior
Latency	Extra network hop: 0.5–2ms gateway processing overhead	Zero gateway overhead; direct connection
SPOF risk	Gateway must be deployed multi-zone HA; single misconfiguration affects all routes	No gateway SPOF; each service is individually exposed with its own failure domain
Team autonomy	Gateway config is shared; changes require gateway team coordination	Teams deploy and configure their own endpoints completely independently

Best Practices

Keep the gateway stateless: validate JWTs in-process using cached public keys (no network call per request), store rate-limit counters in Redis, and store session state in downstream services. A stateless gateway scales horizontally by adding instances.

Build on non-blocking I/O (Nginx, Envoy, Netty). A thread-per-request gateway will exhaust its thread pool when any downstream service degrades — the same failure mode as not having a gateway at all.

Set per-route downstream timeouts: connect timeout 50ms, read timeout proportional to the route's P99 + 50% buffer (e.g., 300ms for a payment route with P99=200ms). Without per-route timeouts, one slow service's slowdown cascades to the gateway's entire connection pool.

Deploy the Backend-for-Frontend pattern: separate gateway instances for mobile, web, and partner APIs. Each BFF is optimized for its consumer's payload size, authentication scheme, and rate limits.

Treat gateway configuration as code: store routes, plugins, and rate limits as YAML in git with CI/CD validation and canary deployment. A gateway misconfiguration that removes all authentication is a production security incident.

When to Use / Avoid

Use When	Avoid When
Running a microservices architecture with multiple public-facing services requiring consistent auth, rate limiting, and CORS enforcement	Running a simple monolithic architecture where a single process handles all client types — a gateway adds latency with no organizational benefit
Supporting multiple client types (mobile, web, IoT) with different payload and authentication requirements	Ultra-low-latency applications where even 1ms of gateway overhead violates the SLA — use direct TCP or co-located service mesh instead
Security teams need a single enforcement point for auth policies, IP allowlisting, and TLS configuration	Engineering team lacks the operational capacity to monitor and scale an additional infrastructure tier