Transformer Systems

Multi-Query Attention (MQA)

Multi-Query Attention (MQA) represents the extreme variant of Grouped-Query Attention where all query heads share exactly one single Key and Value head .

Published June 1, 2026 · By MortalApps · 5 min read · ~854 words

TL;DR

Multi-Query Attention (MQA) represents the extreme variant of Grouped-Query Attention where all query heads share exactly one single Key and Value head ().
It delivers the absolute maximum architectural reduction in KV cache memory footprint, providing a compression factor equal to the number of query heads .
While maximizing inference throughput and minimizing VRAM, MQA often suffers from observable capacity degradation in complex, long-context reasoning tasks.
It remains heavily utilized in highly optimized code models and architectures designed for deployment on edge devices with strict memory limits.

Why This Matters Intuition Deep Dive Takeaways Related

Why This Matters

For specialized models deployed on consumer hardware (such as 24 GB VRAM GPUs) or mobile devices, memory constraints are absolute and unforgiving. MQA reduces the KV cache footprint so aggressively that context lengths can be extended tremendously without triggering OOM errors. It represents the extreme end of the memory-vs-quality Pareto frontier, allowing system architects to serve massive context windows on hardware that would otherwise be completely incapable of booting the model.

Core Intuition

If MHA assigns one librarian to every student, and GQA assigns one librarian to a study group, MQA assigns a single librarian to the entire library. The librarian (representing the K and V tensors) only has one conceptual representation of the facts. Every student (Q) must query this exact same representation. While extremely fast and requiring almost no desk space (VRAM) for the librarian, the nuance of the information retrieved can degrade because the K/V projection lacks multidimensional diversity. The model loses the ability to simultaneously view the context from many distinct angles.

Technical Deep Dive

The KV Cache size for MQA collapses the head dimension variable entirely. The formula simplifies to:

Notice that or is entirely absent from the formula. For a 32-layer model with a 128-dimensional head and 32 query heads, MQA shrinks the per-token KV footprint from,384 bytes to just 512 bytes. Over a span of,024 tokens, this drops the memory requirement from 512 MB to a mere 16 MB. This reduction completely eliminates memory bandwidth as the primary bottleneck during decoding, transitioning the workload into a purely compute-bound state.

Key Takeaways

MQA uses a single KV head for all Query heads, setting

It provides a massive KV cache reduction factor exactly equal to the number of query heads

Inference speed transitions from being memory-bound to compute-bound due to high L2 cache hit rates.

While the industry has largely migrated to GQA for frontier models, MQA remains the gold standard for maximal VRAM compression on edge deployments.

Why This Matters

Core Intuition

Technical Deep Dive

Key Takeaways

Internals / Execution Flow

Performance Implications

Real-World Usage

Common Bottlenecks

Optimization Strategies

Tools & Frameworks

Interview Guidance

Related Concepts