Rate Limiting Architectures
Rate limiting protects downstream services from resource exhaustion by enforcing consumption thresholds per client — the algorithm choice (token bucket vs. sliding window) determines how burst traffic is handled; the implementation choice (centralized Redis vs. distributed local counters) determines how accurate limits are at scale.
- Token bucket allows controlled bursts (a client can consume saved-up tokens in a short window); sliding window counter eliminates bursts entirely by tracking exact request counts within a rolling time window. Choose based on whether your downstream service tolerates burst traffic.
- GitHub enforces 5,000 requests/hour for authenticated users with a fixed window; the boundary abuse vulnerability (double-limit burst at window edge) is mitigated by a secondary 10 requests/second hard cap using a separate sliding window counter.
- Stripe's token bucket is implemented in Redis: bucket capacity 100 tokens, refill rate 100 tokens/second. The atomic Lua script reads token count, calculates tokens earned since last request, updates count, and returns allow/deny — all in a single Redis round-trip.
- Cloudflare uses HyperLogLog probabilistic counting for IP-level rate limiting across 200+ PoPs — exact coordination across all PoPs adds 50–100ms latency per request. The tradeoff: ±5–10% overcounting is acceptable for DDoS mitigation; per-API-key exact limits use centralized Redis.
- Always implement a local in-memory fallback. If the Redis rate-limit cluster fails with no fallback, the rate limiter becomes a total API outage rather than a degraded mode.
The Problem
A SaaS platform's API is called by a customer's data team running a migration script that fires 2,000 requests per second without any throttling. The script exhausts the application's PostgreSQL connection pool in 45 seconds — 100 connections all occupied by the migration queries. Other customers' requests queue at the connection pool and time out. The platform is down for all customers due to one customer's script, and the platform's own health check endpoints are timing out because they also require database connections. Rate limiting applied at the API layer would have rejected the script's requests above the threshold and preserved connection pool availability for all other customers.
Core System Idea
A rate limiting architecture intercepts requests before they reach backend services and evaluates them against consumption thresholds keyed by context (user ID, API key, tenant ID, or IP address). The two production-grade algorithms: (1) Token bucket — each key has a bucket with a fixed capacity and a continuous refill rate. Each request consumes one token; if the bucket is empty, the request is rejected. The bucket can accumulate tokens up to capacity when the client is idle — this allows legitimate bursts. Implementation: two values per key in Redis (tokens_remaining, last_refill_timestamp). A Lua script runs atomically: calculate tokens earned since last request = (now - last_refill) * refill_rate, add to current count (capped at capacity), subtract 1 for the current request, update Redis. (2) Sliding window counter — tracks the exact count of requests within the last N seconds. Implementation: Redis sorted set with request timestamps as scores; count members in [now - window, now] range. Eliminates the fixed-window boundary abuse vulnerability (a fixed window allows 2× the limit by bursting at the end of one window and the start of the next). Memory cost is higher: each request adds an entry to the sorted set rather than updating two scalar values. For distributed systems spanning multiple servers or PoPs, centralized Redis provides exact counts at the cost of one Redis round-trip per request (1–2ms latency overhead). Distributed local counters (HyperLogLog, probabilistic approximate counting) are accurate to ±5–10% but add no network latency — appropriate for DDoS mitigation where overcounting is acceptable.
System Flow
The rate limiter performs an atomic Redis evaluation per request — a Lua script reads current state, updates the counter, and returns allow or deny in a single round-trip.
Real-World Examples Indicative
GitHub enforces 5,000 requests/hour for authenticated OAuth tokens using a fixed hourly window. The fixed window has a known vulnerability: a client can make 5,000 requests in the last second of one window and 5,000 in the first second of the next, hitting 10,000 requests in 2 seconds. GitHub mitigates this with a secondary rate limit: a sliding window that caps at approximately 100 concurrent requests and 900 requests/minute (roughly 15 requests/second). When the primary hourly limit is hit, GitHub returns HTTP 403 with Retry-After pointing to the Unix timestamp of the next window reset. Response headers on every request: X-RateLimit-Limit: 5000, X-RateLimit-Remaining: 4832, X-RateLimit-Reset: 1683000000. GitHub Enterprise Server implements the same algorithm using Redis Cluster for distributed state across its application tier.
Stripe implements token bucket rate limiting for their API in Redis. Per API key: bucket capacity 100 tokens, refill rate 100 tokens/second (sustained limit of 100 RPS with burst allowance). The atomic Lua script: local now = tonumber(ARGV[1]); local tokens = tonumber(redis.call('get', KEYS[1]) or 100); local elapsed = now - tonumber(redis.call('get', KEYS[2]) or now); tokens = math.min(100, tokens + elapsed * 100); if tokens >= 1 then redis.call('set', KEYS[1], tokens - 1); return 1 else return 0 end. The entire evaluate-and-update runs in a single Redis command, preventing race conditions from concurrent requests. When limited, Stripe returns HTTP 429 with error.type = 'rate_limit_error' and Retry-After: 1. Stripe's API key sub-limits apply separately: the POST /v1/charges endpoint has its own 100 requests/second cap independent of the account-level token bucket.
Cloudflare operates rate limiting at 200+ PoPs worldwide. For IP-level DDoS mitigation, exact centralized counting would require synchronizing request counts across all PoPs — adding 50–100ms of cross-PoP coordination overhead per request, worse than the DDoS itself. Cloudflare uses HyperLogLog approximate counting at each PoP independently: each PoP maintains its own local count, accepting ±5–10% inaccuracy. When any PoP sees an IP exceeding 10,000 requests/minute (even with counting error), it blocks the IP and propagates the block rule to all other PoPs via Cloudflare's anycast control plane within ~1 second. For per-customer API key rate limits that require exactness (Cloudflare customers configuring limits on their zones), Cloudflare routes all requests for that customer to a centralized Redis Cluster, accepting the 1–2ms Redis latency overhead in exchange for exact enforcement.
Anti-Patterns
Querying PostgreSQL to count requests in a rolling window: SELECT COUNT(*) FROM api_requests WHERE user_id=? AND created_at > now() - interval '1 minute'. At 10K RPS, this generates 10,000 count queries per second against the primary database — the rate limiter creates the resource exhaustion problem it was supposed to prevent.
Using only a fixed hourly window with no burst cap. A client that learns the window reset time can fire 10,000 requests in a 2-second window at the boundary — double the hourly limit — before the next window opens.
Rate limiter depends entirely on Redis, with no fallback if Redis becomes unavailable. When Redis is down, every request fails the rate limit check — the rate limiter becomes a total API outage. Implement a local in-memory fallback with approximate counting (Guava RateLimiter in Java, token_bucket in Python) that activates when Redis is unreachable.
Returning 429 Too Many Requests with no Retry-After or X-RateLimit-Reset header. Well-behaved clients have no information about when to retry — they immediately retry and contribute to the traffic spike that triggered the limit.
Design Tradeoffs
| Dimension | Token Bucket | Sliding Window Counter |
|---|---|---|
| Burst handling | Allows controlled bursts up to bucket capacity — idle clients accumulate tokens | No burst allowance — each request counts equally against the rolling window |
| Memory per key | Constant: 2 Redis values per key (token count, last refill timestamp) | Higher: sorted set entry per request within the window; scales with request rate |
| Boundary vulnerability | Not applicable — token refill is continuous, not window-based | Eliminates fixed-window double-burst; counts are accurate across time boundaries |
| Best for | APIs where bursty-but-bounded client behavior is acceptable (batch imports, retries) | Strict enforcement where every request in the window must count equally (billing APIs) |
Best Practices
Retry-After and X-RateLimit-Remaining headers on every response, not just on 429s. Well-behaved clients use X-RateLimit-Remaining to slow their request rate before hitting the limit, reducing 429s entirely.RateLimiter, Python token_bucket) that activates when Redis is unreachable. The local limiter is per-instance rather than global — effective limit becomes configured_limit × instance_count, but this is acceptable for the failure mode. The alternative (no fallback) is a total API outage when Redis is down.429 Too Many Requests (client limit exceeded) from 503 Service Unavailable (system overload). Clients must not retry 429 responses immediately — Retry-After tells them when. Including Retry-After prevents retry storms that re-trigger the limit immediately after it resets.When to Use / Avoid
| Use When | Avoid When |
|---|---|
| Exposing public APIs vulnerable to scraping, brute-force, or accidental runaway client loops | Building internal, low-latency microservices within a trusted private network where client behavior is controlled |
| Operating multi-tenant SaaS where one customer's traffic volume can exhaust shared infrastructure for others | The Redis round-trip overhead (1–2ms per request) violates a strict sub-millisecond latency SLA |
| Protecting expensive downstream resources — third-party APIs that charge per call, or database connection pools | Traffic is completely predictable and auto-scaling handles all load variations without resource exhaustion |