← System Design AI Systems
System Design

AI Cost Optimization

GPT-4o costs $2.50/1M input tokens; GPT-4o-mini costs $0.15/1M — routing 80% of requests to the cheaper model cuts your LLM bill by 70%+ without touching quality for simple tasks.

TL;DR
  • GPT-4o costs $2.50/1M input tokens; GPT-4o-mini costs $0.15/1M — routing 80% of requests to the cheaper model cuts your LLM bill by 70%+ without touching quality for simple tasks.
  • Prompt caching (Anthropic, OpenAI, Google) cuts costs on repeated system prompts by 80–90% — the single highest-leverage optimization for most production apps.
  • Continuous batching with vLLM or TGI lets one A100 GPU serve 50–100 concurrent users at <200ms P99 vs 5–10 users with naive serial inference — same hardware, 10× throughput.
  • Token budgets must have hard stops, not just alerts — alert-only strategies fail because engineers are asleep when the runaway loop fires.
  • Cache inference results for high-cardinality-but-repetitive workloads (FAQ bots, classification pipelines) — a 20% hit rate meaningfully reduces provider spend.

The Problem

An e-commerce company launches an AI-powered product description generator. It routes all requests to GPT-4 regardless of task complexity — a request to rephrase a 10-word title uses the same model as a request to write a 500-word SEO article. Three weeks in, the LLM line item is 40% of the entire infrastructure bill, growing 15% week-over-week, with no breakdown by feature or team. This is the default outcome when cost optimization is treated as a post-launch concern rather than a design constraint.

Core System Idea

AI cost optimization requires a cost-aware inference layer with four levers: (1) Prompt caching — send static system prompts (instructions, few-shot examples) as cacheable prefixes; Anthropic charges $0.30/1M tokens for cache reads vs $3.00/1M for full input — a 10× reduction for the static portion. (2) Model routing — classify request complexity with a lightweight model (or heuristic) and route simple tasks to cheap models (GPT-4o-mini, Claude Haiku at $0.25/1M) and complex tasks to expensive ones (GPT-4o, Claude Sonnet). (3) Inference batching — use continuous batching (vLLM, TGI, TensorRT-LLM) to serve multiple requests on the same GPU pass; padding and scheduling are handled by the serving layer. (4) Response caching — cache LLM outputs keyed on a normalized prompt hash; even a 15–20% hit rate cuts spend for FAQ and classification workloads.

System Flow

flowchart TD A["Request"] --> B["Prompt Cache Check"] B -- "Cache Hit" --> G["Response"] B -- "Miss" --> C["Complexity Classifier"] C -- "Simple" --> D["Cheap Model"] C -- "Complex" --> E["Full Model"] D --> F["Response Cache"] E --> F F --> G

Prompt cache check first, then complexity-based routing to cheap vs full model; response cached for reuse.

Real-World Examples Indicative

Cursor IDE

Routes between Claude Sonnet (complex multi-file edits requiring deep reasoning), Claude Haiku (autocomplete and single-line suggestions), and GPT-4o-mini (quick explanations) based on a task complexity classifier. The 16× price difference between Haiku and Sonnet makes routing accuracy worth significant engineering investment — getting the classifier wrong in the expensive direction costs real money at millions of daily requests.

Vercel's v0

Uses Anthropic prompt caching for the large static system prompt (4000+ tokens) describing React component patterns and design constraints. The prompt is identical across all users and cached at the provider — cache reads cost $0.30/1M vs $3.00/1M for uncached input. For a product generating hundreds of thousands of components per day, this alone saves tens of thousands of dollars monthly.

Together.ai / Replicate (inference providers)

Use continuous batching via vLLM to serve multiple concurrent user requests in a single GPU forward pass. A single A100 (80GB) running Llama-3-70B with continuous batching handles 50–100 concurrent users at <200ms P99 latency, vs 5–10 users with naive serial inference. The same GPU revenue serves 10× more users — this is how inference providers price competitively.

Anti-Patterns

Alert-only token budgets

Setting a Slack alert when spend hits $1000/day but no hard stop. The alert fires at 3am, nobody sees it, the runaway loop runs until morning. Hard stops — automatic request rejection or queue throttling — are the only reliable protection.

Caching raw prompt strings

Two prompts that differ only in whitespace or timestamp produce different cache keys but identical LLM outputs. Normalize prompts before hashing: strip variable fields, lowercase, canonicalize whitespace. Otherwise your cache hit rate is near zero.

Routing all traffic to the flagship model

GPT-4o-class models are designed for tasks that genuinely require their capability. Classification, summarization of short text, and simple Q&A don't — and routing them to a flagship model is a 10–16× overpayment.

Batch sizes tuned for throughput without latency guard

A batch that waits 500ms to fill costs you P99 latency on every interactive request. Tune batch timeout separately for async and interactive queues.

No per-feature cost attribution

Aggregating all LLM spend under one line item makes it impossible to identify that the search autocomplete feature is using 60% of the budget while serving 5% of users.

Design Tradeoffs

DimensionBatch InferencePer-Request Direct
Cost per tokenLow (amortized GPU overhead)Higher (dedicated compute slot)
Tail latencyHigh (waits for batch fill)Low (immediate dispatch)
Best forAsync, offline, batch workloadsReal-time interactive use
ToolingvLLM, TGI, TensorRT-LLMOpenAI API, Anthropic API

Best Practices

Enable prompt caching on every call that has a static system prompt. Anthropic requires ≥1024 tokens for cache eligibility; OpenAI's prompt caching is automatic. Check your usage dashboard — if cache hit rate is under 60% for a steady-state workload, your prompt structure is wrong.
Track cost per feature, per user tier, and per model in your logging pipeline. This data is the prerequisite for every other optimization decision.
Set hard token limits per request (e.g., max_tokens=500 for classification tasks) rather than relying on models to self-truncate. Output tokens cost 3–4× more than input tokens at most providers.
Build model routing with a conservative fallback: when the complexity classifier is uncertain, route to the better model. A wrong answer costs more in user trust than the token savings are worth.
For self-hosted inference, benchmark continuous batching throughput with realistic concurrency before choosing GPU SKU — a single A100 with vLLM often outperforms two T4s with naive serving at a fraction of the cost.

When to Use / Avoid

Use WhenAvoid When
LLM spend is >5% of infrastructure costLLM usage is low-volume and cost is negligible
High volume of repetitive prompts (FAQ, classification)Every request is unique and non-cacheable
Multiple model tiers are available for the taskOne model is demonstrably better and the cost is acceptable
Interactive and async workloads can be separatedAll workloads require sub-100ms latency