LLM Gateway Design
Every team that hits its third LLM provider builds an internal gateway — the question is whether they planned it or bolted it on after the first runaway bill.
- Every team that hits its third LLM provider builds an internal gateway — the question is whether they planned it or bolted it on after the first runaway bill.
- A gateway adds 5–15ms overhead per request in exchange for unified rate limiting, cost tracking, and automatic fallback across providers.
- Fallback chains must be heterogeneous: routing from GPT-4o to GPT-4o-mini on failure still fails if OpenAI is down; include Anthropic or a self-hosted model.
- Caching semantically identical prompts at the gateway is the single highest-leverage cost optimization — identical FAQ queries hit the LLM once, not thousands of times.
- Streaming responses require special handling: a gateway that buffers the full response before forwarding defeats the latency benefit of streaming entirely.
The Problem
A startup has three product teams each calling OpenAI directly. One team's poorly bounded loop generates 10M tokens overnight — the company's monthly budget, gone in 8 hours. There's no shared rate limiting, no cost attribution by team, and no fallback when OpenAI returns 429s during peak hours. This is the canonical forcing function for an LLM gateway: without a single control plane, every team reinvents auth, retries, and cost controls independently, inconsistently, and expensively.
Core System Idea
An LLM gateway is a reverse proxy that sits between all internal services and external LLM providers, exposing a unified API (typically OpenAI-compatible) to consumers. It handles: routing requests to the optimal provider based on cost, latency, or model capability; enforcing per-tenant and per-team rate limits via a shared token bucket (backed by Redis); tracking token usage per request for cost attribution; managing provider credentials centrally in a secrets store; executing fallback chains when a provider is degraded; and forwarding streaming responses without buffering. Open-source implementations include LiteLLM (Python, supports 100+ models), Portkey, and self-hosted deployments of OpenRouter. Cloud providers offer managed equivalents: Azure AI Gateway, Google's Vertex AI endpoint management, AWS Bedrock API Gateway.
System Flow
Gateway enforces limits and routes before any token touches a provider; fallback is automatic on provider error or latency breach.
Real-World Examples Indicative
Open-source gateway that normalizes 100+ LLM providers behind an OpenAI-compatible API. Adds ~5ms of overhead per request. Provides per-team rate limits backed by Redis, real-time cost tracking per model, and automatic fallback chains. Teams at Uber use it to route cheaper models for low-stakes tasks and GPT-4 class models for high-accuracy workflows, reducing LLM spend by 40–60%.
Runs at Cloudflare's edge across 200+ PoPs, caching semantically identical prompts globally. A customer support bot receiving the same FAQ 10,000 times per day makes the LLM call once — Cloudflare serves the cached response for the other 9,999. Also provides request logging, rate limiting per API key, and spend alerts. Primary value: prompt caching as a cost control mechanism at the network edge.
Public LLM routing service offering 100+ models via a single API endpoint. Routes to the cheapest available provider for a given model family in real time. Teams use it to avoid managing multiple API keys and to access models not otherwise available in their region. Adds ~20–50ms routing overhead but eliminates the operational burden of multi-provider credential management.
Anti-Patterns
A gateway that accumulates the full LLM response before forwarding adds the entire generation time as latency. For a 500-token response at 50 tokens/sec, that's 10 seconds of unnecessary delay. Use chunked transfer encoding and pipe tokens through as they arrive.
Routing from GPT-4o to GPT-4o-mini on failure only helps with capacity issues. If OpenAI has an outage, both fail. A production fallback chain must include at least one provider from a different company (Anthropic, Cohere, or a self-hosted model).
Each microservice storing its own provider API keys means no central revocation, no spend visibility, and a secret leak that's impossible to fully audit. All credentials must live in one secrets store (AWS Secrets Manager, HashiCorp Vault), accessed only by the gateway.
Sending every LLM request to the provider even for identical prompts is a direct money leak. FAQ chatbots, doc search, and classification pipelines routinely get identical inputs — cache at the prompt hash level.
Aggregating all LLM spend under one bill with no per-team or per-feature breakdown makes it impossible to identify runaway consumers until the invoice arrives.
Design Tradeoffs
| Dimension | Simple Proxy | Intelligent Router |
|---|---|---|
| Routing logic | Static (single provider) | Dynamic (cost, latency, health) |
| Fallback | Manual config change | Automatic with fallback chain |
| Added latency | 2–5ms | 5–15ms (routing + health checks) |
| Cost visibility | None / per-service | Centralized with per-tenant attribution |
Best Practices
When to Use / Avoid
| Use When | Avoid When |
|---|---|
| Using 2+ LLM providers or planning to | Single provider, early prototype stage |
| Multiple teams share LLM access and need cost attribution | Solo developer with one API key |
| Provider outages have business impact — need automatic fallback | Latency budget is so tight that 5–15ms overhead is unacceptable |
| LLM spend is significant enough to warrant optimization | LLM usage is infrequent and cost is negligible |