AI Cost Optimization
GPT-4o costs $2.50/1M input tokens; GPT-4o-mini costs $0.15/1M — routing 80% of requests to the cheaper model cuts your LLM bill by 70%+ without touching quality for simple tasks.
- GPT-4o costs $2.50/1M input tokens; GPT-4o-mini costs $0.15/1M — routing 80% of requests to the cheaper model cuts your LLM bill by 70%+ without touching quality for simple tasks.
- Prompt caching (Anthropic, OpenAI, Google) cuts costs on repeated system prompts by 80–90% — the single highest-leverage optimization for most production apps.
- Continuous batching with vLLM or TGI lets one A100 GPU serve 50–100 concurrent users at <200ms P99 vs 5–10 users with naive serial inference — same hardware, 10× throughput.
- Token budgets must have hard stops, not just alerts — alert-only strategies fail because engineers are asleep when the runaway loop fires.
- Cache inference results for high-cardinality-but-repetitive workloads (FAQ bots, classification pipelines) — a 20% hit rate meaningfully reduces provider spend.
The Problem
An e-commerce company launches an AI-powered product description generator. It routes all requests to GPT-4 regardless of task complexity — a request to rephrase a 10-word title uses the same model as a request to write a 500-word SEO article. Three weeks in, the LLM line item is 40% of the entire infrastructure bill, growing 15% week-over-week, with no breakdown by feature or team. This is the default outcome when cost optimization is treated as a post-launch concern rather than a design constraint.
Core System Idea
AI cost optimization requires a cost-aware inference layer with four levers: (1) Prompt caching — send static system prompts (instructions, few-shot examples) as cacheable prefixes; Anthropic charges $0.30/1M tokens for cache reads vs $3.00/1M for full input — a 10× reduction for the static portion. (2) Model routing — classify request complexity with a lightweight model (or heuristic) and route simple tasks to cheap models (GPT-4o-mini, Claude Haiku at $0.25/1M) and complex tasks to expensive ones (GPT-4o, Claude Sonnet). (3) Inference batching — use continuous batching (vLLM, TGI, TensorRT-LLM) to serve multiple requests on the same GPU pass; padding and scheduling are handled by the serving layer. (4) Response caching — cache LLM outputs keyed on a normalized prompt hash; even a 15–20% hit rate cuts spend for FAQ and classification workloads.
System Flow
Prompt cache check first, then complexity-based routing to cheap vs full model; response cached for reuse.
Real-World Examples Indicative
Routes between Claude Sonnet (complex multi-file edits requiring deep reasoning), Claude Haiku (autocomplete and single-line suggestions), and GPT-4o-mini (quick explanations) based on a task complexity classifier. The 16× price difference between Haiku and Sonnet makes routing accuracy worth significant engineering investment — getting the classifier wrong in the expensive direction costs real money at millions of daily requests.
Uses Anthropic prompt caching for the large static system prompt (4000+ tokens) describing React component patterns and design constraints. The prompt is identical across all users and cached at the provider — cache reads cost $0.30/1M vs $3.00/1M for uncached input. For a product generating hundreds of thousands of components per day, this alone saves tens of thousands of dollars monthly.
Use continuous batching via vLLM to serve multiple concurrent user requests in a single GPU forward pass. A single A100 (80GB) running Llama-3-70B with continuous batching handles 50–100 concurrent users at <200ms P99 latency, vs 5–10 users with naive serial inference. The same GPU revenue serves 10× more users — this is how inference providers price competitively.
Anti-Patterns
Setting a Slack alert when spend hits $1000/day but no hard stop. The alert fires at 3am, nobody sees it, the runaway loop runs until morning. Hard stops — automatic request rejection or queue throttling — are the only reliable protection.
Two prompts that differ only in whitespace or timestamp produce different cache keys but identical LLM outputs. Normalize prompts before hashing: strip variable fields, lowercase, canonicalize whitespace. Otherwise your cache hit rate is near zero.
GPT-4o-class models are designed for tasks that genuinely require their capability. Classification, summarization of short text, and simple Q&A don't — and routing them to a flagship model is a 10–16× overpayment.
A batch that waits 500ms to fill costs you P99 latency on every interactive request. Tune batch timeout separately for async and interactive queues.
Aggregating all LLM spend under one line item makes it impossible to identify that the search autocomplete feature is using 60% of the budget while serving 5% of users.
Design Tradeoffs
| Dimension | Batch Inference | Per-Request Direct |
|---|---|---|
| Cost per token | Low (amortized GPU overhead) | Higher (dedicated compute slot) |
| Tail latency | High (waits for batch fill) | Low (immediate dispatch) |
| Best for | Async, offline, batch workloads | Real-time interactive use |
| Tooling | vLLM, TGI, TensorRT-LLM | OpenAI API, Anthropic API |
Best Practices
When to Use / Avoid
| Use When | Avoid When |
|---|---|
| LLM spend is >5% of infrastructure cost | LLM usage is low-volume and cost is negligible |
| High volume of repetitive prompts (FAQ, classification) | Every request is unique and non-cacheable |
| Multiple model tiers are available for the task | One model is demonstrably better and the cost is acceptable |
| Interactive and async workloads can be separated | All workloads require sub-100ms latency |