← System Design AI Systems
System Design

LLM Gateway Design

Every team that hits its third LLM provider builds an internal gateway — the question is whether they planned it or bolted it on after the first runaway bill.

TL;DR
  • Every team that hits its third LLM provider builds an internal gateway — the question is whether they planned it or bolted it on after the first runaway bill.
  • A gateway adds 5–15ms overhead per request in exchange for unified rate limiting, cost tracking, and automatic fallback across providers.
  • Fallback chains must be heterogeneous: routing from GPT-4o to GPT-4o-mini on failure still fails if OpenAI is down; include Anthropic or a self-hosted model.
  • Caching semantically identical prompts at the gateway is the single highest-leverage cost optimization — identical FAQ queries hit the LLM once, not thousands of times.
  • Streaming responses require special handling: a gateway that buffers the full response before forwarding defeats the latency benefit of streaming entirely.

The Problem

A startup has three product teams each calling OpenAI directly. One team's poorly bounded loop generates 10M tokens overnight — the company's monthly budget, gone in 8 hours. There's no shared rate limiting, no cost attribution by team, and no fallback when OpenAI returns 429s during peak hours. This is the canonical forcing function for an LLM gateway: without a single control plane, every team reinvents auth, retries, and cost controls independently, inconsistently, and expensively.

Core System Idea

An LLM gateway is a reverse proxy that sits between all internal services and external LLM providers, exposing a unified API (typically OpenAI-compatible) to consumers. It handles: routing requests to the optimal provider based on cost, latency, or model capability; enforcing per-tenant and per-team rate limits via a shared token bucket (backed by Redis); tracking token usage per request for cost attribution; managing provider credentials centrally in a secrets store; executing fallback chains when a provider is degraded; and forwarding streaming responses without buffering. Open-source implementations include LiteLLM (Python, supports 100+ models), Portkey, and self-hosted deployments of OpenRouter. Cloud providers offer managed equivalents: Azure AI Gateway, Google's Vertex AI endpoint management, AWS Bedrock API Gateway.

System Flow

flowchart TD A["Internal Service"] --> B["LLM Gateway"] B --> C["Rate Limiter / Cost Tracker"] C --> D["Dynamic Router"] D --> E["OpenAI / Anthropic"] D --> F["Fallback Provider"] E --> B F --> B B --> A

Gateway enforces limits and routes before any token touches a provider; fallback is automatic on provider error or latency breach.

Real-World Examples Indicative

LiteLLM (used by Uber, Cisco, and thousands of teams)

Open-source gateway that normalizes 100+ LLM providers behind an OpenAI-compatible API. Adds ~5ms of overhead per request. Provides per-team rate limits backed by Redis, real-time cost tracking per model, and automatic fallback chains. Teams at Uber use it to route cheaper models for low-stakes tasks and GPT-4 class models for high-accuracy workflows, reducing LLM spend by 40–60%.

Cloudflare AI Gateway

Runs at Cloudflare's edge across 200+ PoPs, caching semantically identical prompts globally. A customer support bot receiving the same FAQ 10,000 times per day makes the LLM call once — Cloudflare serves the cached response for the other 9,999. Also provides request logging, rate limiting per API key, and spend alerts. Primary value: prompt caching as a cost control mechanism at the network edge.

OpenRouter

Public LLM routing service offering 100+ models via a single API endpoint. Routes to the cheapest available provider for a given model family in real time. Teams use it to avoid managing multiple API keys and to access models not otherwise available in their region. Adds ~20–50ms routing overhead but eliminates the operational burden of multi-provider credential management.

Anti-Patterns

Buffering streaming responses

A gateway that accumulates the full LLM response before forwarding adds the entire generation time as latency. For a 500-token response at 50 tokens/sec, that's 10 seconds of unnecessary delay. Use chunked transfer encoding and pipe tokens through as they arrive.

Homogeneous fallback chains

Routing from GPT-4o to GPT-4o-mini on failure only helps with capacity issues. If OpenAI has an outage, both fail. A production fallback chain must include at least one provider from a different company (Anthropic, Cohere, or a self-hosted model).

Per-service API key management

Each microservice storing its own provider API keys means no central revocation, no spend visibility, and a secret leak that's impossible to fully audit. All credentials must live in one secrets store (AWS Secrets Manager, HashiCorp Vault), accessed only by the gateway.

No semantic cache

Sending every LLM request to the provider even for identical prompts is a direct money leak. FAQ chatbots, doc search, and classification pipelines routinely get identical inputs — cache at the prompt hash level.

Ignoring cost attribution

Aggregating all LLM spend under one bill with no per-team or per-feature breakdown makes it impossible to identify runaway consumers until the invoice arrives.

Design Tradeoffs

DimensionSimple ProxyIntelligent Router
Routing logicStatic (single provider)Dynamic (cost, latency, health)
FallbackManual config changeAutomatic with fallback chain
Added latency2–5ms5–15ms (routing + health checks)
Cost visibilityNone / per-serviceCentralized with per-tenant attribution

Best Practices

Use an OpenAI-compatible API surface internally — all providers converge on this format and it lets you swap LiteLLM, Portkey, or a custom implementation without touching consumer code.
Set hard token budgets per team per day with automatic cutoff, not just alerting. Alerts get ignored; hard stops prevent incidents.
Implement semantic caching keyed on prompt hash (after stripping dynamic fields like timestamps) — even a 20% cache hit rate cuts provider costs meaningfully.
For streaming: use async, non-buffering proxying. Test explicitly that your gateway does not accumulate the full response before forwarding — it's a common silent bug.
Run active health checks against each provider every 30 seconds; failover should be automatic, not triggered by a page to an on-call engineer.
Log every request with: provider used, model, input tokens, output tokens, latency, cost, and requesting team. This is your cost audit trail.

When to Use / Avoid

Use WhenAvoid When
Using 2+ LLM providers or planning toSingle provider, early prototype stage
Multiple teams share LLM access and need cost attributionSolo developer with one API key
Provider outages have business impact — need automatic fallbackLatency budget is so tight that 5–15ms overhead is unacceptable
LLM spend is significant enough to warrant optimizationLLM usage is infrequent and cost is negligible