Why do LLMs need so much RAM?

LLM memory usage comes from four sources: Model Weights (fixed cost, the AI's permanent knowledge), KV Cache (dynamic, grows with conversation length), Activations (temporary tensors during computation that spike during Prefill), and Runtime Overhead (inference engine buffers and CUDA contexts). For large contexts, the KV cache becomes the dominant consumer.

LLM Internals

KV Cache Explained: Prefill, Decode, Context Windows & Why LLMs Need So Much Memory

Q: What is the KV cache in simple terms?

The KV cache is a temporary memory bank where an LLM saves the intermediate mathematical results (Keys and Values) from processing previous tokens. By storing these, the model avoids recalculating the entire conversation from scratch each time it generates a new word, trading memory for speed.

Q: Why is the first token slow?

Before generating any text, the model must process your entire prompt simultaneously during the Prefill phase. Every token mathematically compares itself against every other token. A 10,000-token prompt requires millions of matrix multiplications, creating a noticeable delay before the first word appears. This is called Time to First Token (TTFT).

Q: What is Grouped Query Attention (GQA) and why does it matter?

Grouped Query Attention (GQA) is an architectural optimization where multiple Query heads share a single Key and Value head. In older models, every attention head stored its own Key-Value pairs, causing the KV cache to grow very large. GQA reduces the number of Keys and Values stored by up to 8x, dramatically shrinking the cache without meaningful loss in model quality. Most modern models including Llama 3 use GQA.

The invisible memory consumer behind first-token latency, slow long chats, and out-of-memory crashes, explained from first principles.

June 2026 · 16 min read · By MortalApps

Imagine you have just purchased a high-end GPU with 24GB of VRAM. You load a well-optimized 8B model; the weights take up a modest 16GB, leaving 8GB to spare. Everything runs smoothly. Then you paste in a long PDF, or continue an hour-long coding session with hundreds of lines of context. Your system freezes. The terminal outputs a fatal error: Out of Memory. Your application crashes.

You did not load any new model weights. The neural network did not grow larger. So where did those 8 gigabytes go?

The answer is the KV cache, the most misunderstood component in LLM inference. Understanding it explains nearly every puzzling behavior in local AI: why the first token takes longer than expected, why long conversations gradually slow to a crawl, and why context windows have physical memory limits rather than just software ones.

TL;DR

Prefill: the model reads your entire prompt before generating anything. The longer the prompt, the longer you wait for the first word.
Decode: the model generates one token at a time. Speed here depends on how fast your hardware can move data, not how fast it can calculate.
KV Cache: a temporary memory store that saves the model's understanding of past tokens so it does not have to re-read the whole conversation on every step. It grows as you chat.
For short conversations, the model weights use most of your memory. For long ones, the KV cache takes over, and that is what causes crashes mid-conversation.

Table of Contents

What Is the KV Cache?
The Real Cost of Generating a Token
Prefill vs Decode
Why the First Token Is Slow
What Is Actually Stored in the KV Cache
KV Cache vs Model Weights
Why LLMs Need So Much Memory
Context Window Memory Costs
Why Long Conversations Become Slower
Flash Attention vs KV Cache
Prefill vs Decode by Use Case
KV Cache Compression
Why This Matters for Local AI
Visual Walkthrough
Common Misconceptions
FAQ

What Is the KV Cache?

The KV (Key-Value) Cache is a dynamic, temporary memory bank that LLMs use to speed up text generation. As a model processes your prompt and generates a response, it performs billions of calculations to understand how every token relates to every other token: a mechanism called self-attention.

Without optimization, the model would recalculate these relationships from scratch for every new token it generates. For a conversation with 5,000 tokens, generating the 5,001st token would require recomputing the relationships across all 5,000 previous tokens. This creates an O(N²) computational problem that would make practical AI inference prohibitively expensive.

The KV cache solves this by saving the intermediate results: the "Keys" and "Values" from the attention mechanism, in GPU memory. Instead of recomputing history, the model looks it up. The tradeoff: speed at the cost of VRAM.

The KV cache is the model's working memory. The weights are its education. Both live in VRAM, but only one of them grows with every token in the conversation.

The Real Cost of Generating a Token

A language model is not a brain that reads sentences. It is a deterministic mathematical engine that performs massive matrix multiplications to predict one token at a time. Here is what happens for every single token generated:

Tokenization: Your text is split into tokens (subword units each assigned a numeric ID. "Hello, world" becomes [Hello] [,] [world].)
Embeddings: Each token ID is mapped to a high-dimensional vector representing its meaning in the model's learned concept space.
Transformer Layers & Self-Attention: Inside each layer, every token computes how much "attention" it should pay to every other token. The word "bank" in "the bank of the river" learns to attend strongly to "river" rather than "money".
Output Distribution: After all layers, the model produces a probability score over its entire vocabulary (tens of thousands of tokens) and selects the highest-scoring candidate.

This entire process runs once per token. A 500-word response requires hundreds of iterations of this pipeline. The KV cache exists to make that repetition practical.

Prefill vs Decode

LLM inference has two fundamentally distinct phases with different hardware bottlenecks. Nearly every performance behavior (slow startup, sluggish generation, OOM errors at long context) traces back to understanding how these phases interact.

The LLM Inference Pipeline showing Prefill and Decode phases

The LLM Inference Pipeline: Prefill processes the full prompt in parallel; Decode generates tokens sequentially using the KV cache.

Phase 1

Prefill: The Reading Phase

When you submit a prompt, the model processes all tokens simultaneously. GPUs thrive on parallel operations, so compute utilization spikes toward 100%. The goal is twofold: understand the full context of your prompt, and produce the very first output token. Because all prompt tokens are processed in parallel, Prefill is compute-bound; its speed is limited by the GPU's raw TFLOPS, not its memory bandwidth.

Phase 2

Decode: The Writing Phase

Once the first token is produced, the model enters Decode. Language models generate autoregressively, one token at a time, each depending on all previous tokens. This phase cannot be parallelized. To generate each new token, the GPU must load the full model weights from VRAM into its compute cores, perform a small amount of math, and read the KV cache to calculate attention over all prior tokens. Because of this constant high-volume data movement, Decode is memory-bandwidth-bound; its speed is limited by how fast VRAM can transfer data (GB/s), not by TFLOPS.

Why the First Token Is Slow

The noticeable pause after hitting Enter, before the AI starts typing, is called Time to First Token (TTFT). It is a direct consequence of the Prefill bottleneck.

During Prefill, every token must mathematically compare itself against every previous token. The workload scales quadratically:

Prompt Length	Scenario	Prefill Complexity	User Experience
100 tokens	Short question	~10,000 comparisons	Instant first token
1,000 tokens	Code snippet or document	~1 million comparisons	Noticeable pause (~1s)
10,000 tokens	Long PDF or codebase	~100 million comparisons	Visible wait (several seconds)
100,000 tokens	Full book or large repo	~10 billion comparisons	Long processing delay

This is why RAG (Retrieval-Augmented Generation) architectures retrieve only the most relevant chunks rather than feeding entire documents into the prompt, reducing Prefill cost dramatically.

What Is Actually Stored in the KV Cache

Inside each transformer layer, the self-attention mechanism uses three components: Query (Q), Key (K), and Value (V), to determine how tokens should relate to one another.

Think of a conference room. You are searching for a database expert (your Query). Everyone wears a name tag listing their expertise (their Key). When your Query matches someone's Key, you listen to their contribution (their Value).

When the model processes a token, it computes Q, K, and V for that token. Here is the key insight: the Keys and Values of past tokens never change. Only the new token needs a new Query. So the model stores the K and V matrices for every prior token in VRAM, and each new Query simply looks up against that growing database. This is the KV cache.

Engineering Note: Grouped Query Attention (GQA)

In older transformer architectures, every attention head stored its own unique Key and Value vectors, causing cache size to scale with the number of heads. Modern models use Grouped Query Attention (GQA), where multiple Query heads share a single Key-Value pair. Llama 3, Qwen 3, and most current models use GQA, reducing KV cache memory by up to 8x compared to the original Multi-Head Attention design. This is why modern 8B models have manageable cache requirements despite having many attention heads.

KV Cache vs Model Weights

Confusing these two is the most common mistake in LLM memory planning. They are fundamentally different in nature:

Feature	Model Weights	KV Cache
Purpose	Long-term knowledge and reasoning	Active working memory for the current conversation
Nature	Static, read-only after training	Dynamic, grows with every token generated
Size	Fixed (e.g., 16GB for an 8B model at 4-bit)	Variable, scales linearly with context length
Shared across users?	Yes, all users share the same weights	No, each conversation has its own unique cache
Analogy	A lifetime of education and expertise	Notes taken on a notepad during a meeting

This distinction matters in production: if you host a 70B model, the weights are shared across all users. But a user uploading a 50,000-word document consumes gigabytes of VRAM for their unique KV cache, while a user asking a one-line question consumes almost nothing. Cache memory scales per-user; weight memory does not.

Why LLMs Need So Much Memory

VRAM consumption during inference comes from four distinct sources. For most developers, only the first two matter in practice:

Breakdown of where LLM memory goes: weights, KV cache, activations, and overhead

Where LLM memory goes: weights are the fixed cost, the KV cache is the variable one that grows with every token.

Model Weights: The AI's permanent knowledge. A fixed cost determined by model size and quantization. A 32B model at 4-bit uses roughly 20GB regardless of conversation length.
KV Cache: The AI's active working memory. Dynamic, scales linearly with the number of tokens in your context window. This is what causes OOM errors mid-conversation.
Activations: Temporary tensors used during the forward pass. These spike massively during Prefill (when all prompt tokens are processed simultaneously) and then largely disappear during Decode.
Runtime Overhead: Memory consumed by the inference engine (llama.cpp, vLLM, Ollama) for CUDA contexts, memory fragmentation buffers, and batching infrastructure. Typically 0.5–2GB depending on the engine.

The Rule Most Developers Miss

For short contexts, model weights dominate your memory profile. For long contexts, the KV cache dominates. In production inference systems serving many concurrent users with large contexts, the KV cache frequently becomes the single largest memory consumer, larger than the model weights themselves.

Context Window Memory Costs

Every token in your prompt and every token the model generates adds a new Key-Value entry to the cache. The context window limit is not an arbitrary software setting; it reflects the physical memory available for the cache to grow into.

Chart showing how KV cache memory grows with context window size

KV cache memory grows linearly with context length. At 128K tokens, the cache for an 8B model exceeds the model weights themselves.

How Much Memory Does One Token Use?

For Llama 3 architecture models using GQA and 16-bit precision:

Model	Layers	KV Heads	Head Dim	Per-Token Cache
Llama 3 8B	32	8	128	~131 KB
Llama 3 70B	80	8	128	~327 KB

The formula: 2 (K+V) × 2 bytes (fp16) × layers × kv_heads × head_dim. For Llama 3 8B: 2 × 2 × 32 × 8 × 128 = 131,072 bytes per token.

How the Cache Scales with Context

Tokens	Scenario	Llama 3 8B Cache	Llama 3 70B Cache
8K	Short coding session	~1.1 GB	~2.7 GB
32K	Long document or extended chat	~4.3 GB	~10.7 GB
50K	Short book or large codebase	~6.5 GB	~16.3 GB
128K	Full book or enterprise knowledge base	~17.2 GB	~42.9 GB

At 128K context, the Llama 3 8B KV cache (17GB) exceeds the model weights themselves. Running Llama 3 70B at full 128K context requires budgeting over 40GB of VRAM purely for conversation memory.

What Context Window Can Your Hardware Handle?

Practical context limits when running a quantized 8B-class model (~5GB weights) and reserving room for the KV cache:

Hardware	Practical Context Window	Typical Use Case
16GB RAM / VRAM	4K – 8K tokens	General chat, short coding assistance
32GB RAM / VRAM	8K – 32K tokens	RAG pipelines, long document summarization
64GB unified memory	32K – 64K tokens	Multi-document analysis, large codebases
128GB+ workstation	64K – 128K+ tokens	Full-scale book analysis, enterprise serving

Three Variables That Change These Numbers

GQA adoption: Models without GQA consume cache memory significantly faster than the figures above.
Weight quantization: Running a 4-bit model instead of 16-bit frees more VRAM headroom for the cache.
KV cache quantization: Enabling 8-bit or 4-bit cache compression can double or quadruple these practical limits.

Why Long Conversations Become Slower

A 20-message chat feels snappy. A 500-message chat feels sluggish. The mechanism is straightforward once you understand Decode phase dynamics.

During Decode, token generation speed is bounded by memory bandwidth. To produce each new token, the GPU must not only load the model weights but also stream the entire KV cache through its compute cores to calculate attention scores over all prior tokens.

As your conversation grows to 30,000 tokens, the KV cache might occupy 10GB. To generate each token, the GPU must read that 10GB across the memory bus. At 20 tokens per second, that means attempting 200GB of cache reads every second, quickly hitting the physical memory bandwidth ceiling. Token generation speed measurably degrades as the cache grows.

Token generation is not bottlenecked by how fast your GPU thinks. It is bottlenecked by how fast it can read memory.

Flash Attention vs KV Cache

These two optimizations are often confused because both relate to the attention mechanism. They solve entirely different problems:

Optimization	What It Solves	Phase Affected	What It Does
Flash Attention	Activation memory explosion during Prefill	Prefill	Standard attention materializes an N×N matrix in VRAM (32K context = 32,000×32,000 entries, instant OOM). Flash Attention computes attention in small SRAM blocks, avoiding this. It saves computation memory, not cache memory.
KV Cache	Redundant recomputation during Decode	Decode	Stores Keys and Values from past tokens so the model does not re-run attention over them on every step. It saves compute time at the cost of persistent VRAM.

Flash Attention and KV Cache complement each other: Flash Attention makes long Prefill feasible, KV Cache makes long Decode practical. Modern inference engines use both.

Prefill vs Decode by Use Case

Depending on what you are building, the hardware bottleneck shifts entirely. Optimizing the wrong phase wastes infrastructure spend:

Use Case 1

Chat Applications

Users care primarily about Decode throughput; text should stream smoothly at a readable pace. A fast Time to First Token improves perceived responsiveness, but sustained streaming speed is what defines the user experience. Optimize for memory bandwidth.

Use Case 2

RAG Systems

You feed the model a large document (50K-token Prefill) and ask for a short summary (100-token Decode). The Decode phase is nearly free. Prefill throughput is the absolute bottleneck. Optimize for TFLOPS and consider chunked prefill strategies.

Use Case 3

Coding Assistants

Tools like Copilot send background requests containing your entire codebase (long Prefill) and need autocomplete suggestions immediately (fast Decode). Both phases must be fast. This is the hardest infrastructure profile to optimize for simultaneously.

Use Case 4

Enterprise Inference Servers

Systems like vLLM use continuous batching (in-flight batching) to overlap work. Because Prefill is compute-bound and Decode is memory-bound, the engine can process User A's Prefill while simultaneously running User B's Decode, maximizing GPU utilization. This requires careful tradeoff management between latency, throughput, and cache memory allocation.

KV Cache Compression

Because the KV cache grows so large, AI researchers have developed dedicated compression techniques. Note that standard model quantization (GGUF, AWQ, GPTQ) compresses model weights; it does not compress the KV cache, which defaults to 16-bit precision unless explicitly configured otherwise.

KV Cache Quantization

The most practical technique for local AI users. By reducing the numerical precision of stored Keys and Values:

Method	Memory Reduction	Quality Impact	Availability
8-bit KV Cache (FP8)	50% reduction	Negligible	llama.cpp `-ctk q8_0 -ctv q8_0`
4-bit KV Cache (INT4)	75% reduction	Minor on most tasks	llama.cpp, vLLM, LM Studio

TurboQuant

A more advanced technique from Google DeepMind, presented at ICLR 2026. TurboQuant uses a two-step approach: first, a randomized Hadamard transform spreads outlier values across dimensions, making quantization easier. Second, the Quantized Johnson-Lindenstrauss (QJL) transform removes the bias this introduces. The result is KV cache compression to 3-bit precision with near-zero accuracy loss, achieving roughly 6x memory reduction, making 128K+ context windows practical on consumer hardware that previously could not support them. Early llama.cpp community implementations are already available. See our detailed TurboQuant breakdown for a full explanation.

KV Compression in Practice

Without KV cache compression, running a 128K context window on a 24GB GPU is physically impossible with any meaningful model. With 8-bit KV cache quantization enabled, that same GPU can handle roughly twice the context length. With TurboQuant-style 3-bit compression, the limit extends further still, at the cost of slightly more complex configuration.

Why This Matters for Local AI

If you run models locally with Ollama, llama.cpp, LM Studio, or Open WebUI, the KV cache directly determines your stability, speed, and maximum context length.

The "Slow Death" Crash

You loaded a model that just fits your GPU. As the conversation grows, the KV cache pushes memory usage over the edge. The OS offloads the cache to system RAM. Speed drops from 30 tokens/sec to 1 token/sec. The fix: always leave 20–30% of VRAM empty after loading a model; that headroom is for your KV cache to grow into.

Ollama: The num_ctx parameter in your Modelfile controls the pre-allocated context window size. Setting it to 128K on a 16GB machine will cause immediate OOM. Match this value to your hardware. For 16GB, 4K–8K is a safe starting point for most models.
llama.cpp: The -c flag defines the pre-allocated KV cache size. Enable 8-bit KV caching with -ctk q8_0 -ctv q8_0; this is one of the highest-impact single-flag optimizations available and is particularly valuable on MacBooks and 24GB GPUs.
LM Studio: Provides a "Context Length" slider and options to enable Flash Attention and KV cache quantization. Do not slide context length to the maximum without checking your available headroom; the UI does not warn you before an OOM crash.
General rule: If you experience a model that starts fast but degrades over a long session, the KV cache is the culprit. Either reduce num_ctx, enable cache quantization, or use a smaller model to free more headroom.

Visual Walkthrough: One Prompt, Start to Finish

Here is what actually happens when you type: "Write a poem about a cybernetic forest."

T = 0ms: Input

You press Enter. The text is tokenized into a sequence of numeric IDs and converted to embedding vectors.

T = 50ms: Prefill Begins

All prompt tokens hit the GPU simultaneously. Compute utilization spikes toward 100%. Q, K, and V vectors are calculated for every token in the prompt in parallel.

T = 100ms: KV Cache Created

The GPU permanently stores the K and V matrices for all prompt tokens in VRAM. The KV cache is born. It will remain in memory until you clear the conversation or the session ends.

T = 500ms: First Token Generated

"Neon" appears on screen. Prefill ends. The GPU was busy computing for ~450ms before you saw anything; this is your TTFT.

T = 510ms: Decode Begins

To generate the next word, the model computes Q, K, and V for "Neon" only. It compares the new Query against the Keys stored in the KV cache to determine which prior tokens to attend to.

T = 550ms: Cache Grows

The model outputs "leaves". The K and V vectors for "Neon" are appended to the cache. The cache is now slightly larger. This cycle repeats for every subsequent token.

Response Complete

The full poem has been generated. The KV cache now contains the Keys and Values for every token in the prompt and response, consuming VRAM until the session is cleared.

Common Misconceptions

Misconception 1

"More VRAM automatically makes a model smarter."
VRAM capacity determines your maximum context window; it controls how much conversation history the model can hold in its KV cache. Intelligence is determined by the model weights. More VRAM means longer memory, not deeper reasoning.

Misconception 2

"The context window limit is just a software setting I can increase freely."
Context length is a physical memory constraint. Setting num_ctx=128000 on a 16GB machine will allocate more KV cache than the available VRAM and immediately crash. The limit is real, not arbitrary.

Misconception 3

"Quantizing my model weights will fix my OOM errors during long chats."
Weight quantization shrinks the model footprint at load time. If the OOM error occurs mid-conversation as context grows, the KV cache is the cause, not the weights. You need KV cache quantization (-ctk q8_0) or a reduced context window, not weight compression.

Frequently Asked Questions

What is the KV cache in simple terms?

The KV cache is a temporary memory bank where an LLM saves its intermediate mathematical results: the "Keys" and "Values" of the attention mechanism, for every past token. By storing these, the model avoids recomputing the entire conversation history each time it generates a new word, trading VRAM for speed.

What is the difference between Prefill and Decode in LLM inference?

Prefill is the initial reading phase where the model processes your entire prompt in parallel; it is compute-bound (limited by TFLOPS) and determines Time to First Token. Decode is the generation phase where the model produces one token at a time using the KV cache; it is memory-bandwidth-bound (limited by GB/s) and determines how fast text streams to your screen.

Why is the first token slow?

Before generating any output, the model must process your entire prompt simultaneously during Prefill. Every token compares itself against every other token, scaling quadratically. A 10,000-token prompt requires millions of matrix multiplications, creating a noticeable delay before the first word appears. This is called Time to First Token (TTFT).

Which consumes more memory: model weights or the KV cache?

For short conversations, model weights dominate. For long contexts, the KV cache often overtakes them. A Llama 3 8B model at 128K context requires roughly 17GB for the KV cache alone, more than the model weights. In production inference systems with many concurrent users, the cache is frequently the single largest memory consumer.

Why does VRAM usage increase during a conversation?

Every token you send and every token the model generates gets appended to the KV cache as a new Key-Value pair. The longer the conversation, the physically larger the cache grows in VRAM. Once the cache fills your remaining VRAM, the system either crashes or falls back to much slower system RAM offloading.

Why does a model fit in memory initially but crash later?

You successfully loaded the Model Weights into VRAM but left insufficient headroom for the KV Cache to grow. The weights are a fixed cost; the cache is dynamic. As the conversation lengthens, the cache fills the remaining VRAM until the system crashes. Always leave 20–30% of VRAM empty after loading a model.

What is Grouped Query Attention (GQA) and why does it matter?

GQA is an architectural optimization where multiple Query heads share a single Key and Value head. In older Multi-Head Attention models, every head stored its own K-V pairs, causing the cache to explode in size. GQA reduces the number of stored K-V pairs by up to 8x with minimal quality loss. Most modern models including Llama 3, Qwen 3, and Mistral use GQA, which is a major reason their cache requirements are manageable.

What is the difference between Flash Attention and KV Cache?

They solve different problems. Flash Attention reduces the temporary activation memory used during Prefill by computing attention in blocks rather than materializing a full N×N matrix. KV Cache reduces redundant recomputation during Decode by storing and reusing past Key-Value pairs. Flash Attention saves Prefill memory; KV Cache saves Decode compute.

Why does Ollama become slower during long chats?

As the KV cache grows, the GPU must load the entire cache across its memory bus to calculate attention for each new token. A 30,000-token conversation might produce a 10GB KV cache. At 20 tokens per second, the GPU must read that 10GB cache 20 times every second, quickly saturating memory bandwidth and causing measurable slowdowns.

Does quantization reduce KV cache size?

Standard weight quantization (GGUF, AWQ, GPTQ) shrinks model weights but does not affect the KV cache, which defaults to 16-bit precision. To reduce cache size you must explicitly enable KV cache quantization: 8-bit halves the memory requirement, 4-bit reduces it by 75%. In llama.cpp, use -ctk q8_0 -ctv q8_0 to enable 8-bit KV caching.

Can KV cache be stored in system RAM instead of VRAM?

Yes, this is called offloading. However, system RAM bandwidth is far slower than GPU VRAM, which means the memory bus bottleneck during Decode becomes dramatically worse. A model generating 30 tokens per second from VRAM may drop to 1–3 tokens per second when the cache is offloaded to system RAM.

Does clearing a chat delete the KV cache?

Yes. The KV cache is tied to your active conversation session. When you start a new chat or clear the context, the VRAM allocated to that cache is freed. This is why starting a fresh conversation often immediately improves performance after a long sluggish session.

What is TurboQuant and how does it help?

TurboQuant is a KV cache compression technique from Google DeepMind, presented at ICLR 2026. It applies a randomized Hadamard transform followed by the Quantized Johnson-Lindenstrauss (QJL) method to compress the cache to 3-bit precision with near-zero accuracy loss, roughly 6x memory reduction. This makes very long context windows (128K+) practical on consumer hardware. Early llama.cpp community implementations are already available.

Key Takeaways

First-token latency is compute-bound. Prefill requires massive parallel matrix multiplications to process your prompt. Longer prompts mean longer waits.
Token generation is memory-bandwidth-bound. Decode speed is limited by how fast your GPU can read the KV cache across its memory bus, not by TFLOPS.
Context length equals physical memory. Every token adds to the KV cache. Exhausting VRAM with a bloated cache is the primary cause of mid-conversation crashes.
Weight quantization and KV cache quantization are different. Quantizing your model weights frees load-time memory. To survive long contexts, you need KV cache quantization separately.
GQA and KV compression are what make long contexts practical on consumer hardware. Without them, 128K context windows would require server-grade infrastructure.

Disclaimer

The technical descriptions, memory estimates, and performance figures in this article reflect our understanding at the time of writing. LLM architectures and inference tooling evolve rapidly. Always consult the latest documentation for your specific model and inference engine before making infrastructure decisions.

Related Concepts

← Back to Blog