← System Design AI Systems
System Design

Context Window Management

Active context management is non-negotiable for robust, cost-effective LLM applications in production.

TL;DR
  • Active context management is non-negotiable for robust, cost-effective LLM applications in production.
  • Finite context windows require explicit strategies: chunking, summarization, selective retrieval, and rolling windows.
  • Mismanaging context leads directly to hallucination spikes, instruction drift, and unpredictable API costs.
  • Prompt caching (Anthropic, OpenAI, Google) can cut context costs 80–90% — but only if content is ordered correctly.
  • The "lost in the middle" phenomenon means LLMs attend poorly to content injected mid-context; recency and primacy matter.

The Problem

Large Language Models operate with a finite context window — a hard token limit covering both input and output. In production, long-running workflows, multi-turn conversations, and document processing inevitably exceed this boundary. Without proactive management, systems lose state, instructions drift, and responses become incoherent or hallucinated. Concretely: GPT-4 Turbo offers 128K tokens, Claude 3.5 Sonnet 200K, Gemini 1.5 Pro up to 1M — but larger windows are more expensive per call, and empirical research shows that LLMs attend poorly to content injected in the middle of long contexts (the "lost in the middle" phenomenon, Liu et al. 2023). Context size is not a substitute for context management.

Core System Idea

The solution is a Context Orchestration Layer — a stateful intermediary between application logic and the LLM API. This layer curates, compresses, and injects only the most relevant content into each prompt. It maintains a memory hierarchy: the system prompt (instructions, persona) stays fixed; recent turns are kept verbatim; older history is progressively summarized or evicted; long-term facts are retrieved from a vector store on demand. Critically, content ordering follows provider KV cache semantics: static/cacheable content comes first (system prompt, retrieved documents), dynamic content last. This structure lets providers like Anthropic and OpenAI reuse cached prefixes across calls, reducing latency and cost dramatically. The layer also runs pre-flight token counting before every call to prevent overages. Frameworks like LangChain, LlamaIndex, and Semantic Kernel formalize this pattern as "memory management."

System Flow

flowchart TD A[User Input] --> B[Application Logic] B --> C[Context Orchestrator] C --> D["Long-Term Memory\n(Vector DB / KV Store)"] D --> C C --> E[LLM API] E --> F[LLM Response] F --> C C --> B

Context Orchestrator sits between application logic and the LLM, pulling from long-term memory and managing what enters each prompt.

Real-World Examples Indicative

ChatGPT / Claude.ai

Conversational AI platforms manage multi-turn context through rolling windows and server-side summarization. When a conversation grows long, older turns are summarized or dropped. System prompts are cached at the prefix level so repeated instructions don't cost full tokens every turn.

Customer Support Bots (Intercom, Zendesk AI)

These inject CRM data, past ticket history, and account details into each LLM call. A context management layer summarizes prior interactions and selectively retrieves relevant past cases — preventing the prompt from growing unbounded as a conversation continues across sessions.

Legal Document Review (Casetext, Harvey AI)

Legal documents routinely exceed 100K tokens. These platforms chunk documents by section, embed each chunk, and retrieve only the most relevant passages per query. The LLM never sees the full document — only the curated, relevant excerpts, plus the legal question and prior analysis.

Anti-Patterns

Naive history concatenation

Appending the full raw conversation history to every prompt. Rapidly exhausts token limits and forces the LLM to "forget" early instructions due to the recency bias in attention.

Blind summarization

Summarizing without prioritizing instructions. Critical constraints and task-specific rules get collapsed into vague summaries, causing instruction drift mid-workflow.

Static content at the end

Placing system prompts or shared context after dynamic user content. This defeats provider-level KV caching — the cacheable prefix must be stable and come first.

Ignoring the "lost in the middle" effect

Injecting the most important facts in the middle of a long context. LLMs show measurably worse recall for mid-context content; critical facts belong at the start or end.

Assuming large windows eliminate the problem

1M-token windows exist but every token costs money, latency scales with context length, and relevance degrades with noise. Large windows are headroom, not a strategy.

No prompt injection defense

Accepting user content directly into the context without sanitization. Adversarial inputs can override system instructions (prompt injection), a well-documented security concern in production AI systems.

Design Tradeoffs

DimensionSliding Window / Rolling TruncationSummarizationRetrieval (RAG)
Implementation costLowMediumHigh
Information lossHigh (drops exact content)Medium (lossy compression)Low (preserves source)
Latency addedNoneMedium (summarization call)Medium (embedding + search)
Best forShort conversationsLong conversationsDocument-heavy workflows

Best Practices

Order context for KV cache: system prompt → retrieved documents → conversation history → current user message. Static content must come first.
Run pre-flight token counting (e.g., tiktoken for OpenAI, tokenizer APIs) before every LLM call to catch overages early.
Put the most critical instructions at the top of the system prompt and repeat key constraints near the end — exploit primacy and recency effects.
Use semantic chunking (split on meaning boundaries, not fixed character counts) for cleaner summarization and retrieval.
Implement a memory tier: verbatim recent turns → rolling summary → vector retrieval → eviction. Never jump straight from verbatim to eviction.
Monitor instruction adherence in production with an eval layer; context drift is operationally silent until it causes visible failures.

When to Use / Avoid

Use WhenAvoid When
Multi-turn or agentic workflows span more than 3–4 interactionsSingle-turn, stateless LLM calls with no history dependency
Token cost is a critical operational metricThe task is purely batch/offline with no latency sensitivity
Instruction adherence must hold across a long sessionContext is always short and fits comfortably in one prompt
External data (CRM, docs, KB) must be dynamically injectedThe system prompt is the only context and never changes