Transformer Systems

Grouped-Query Attention (GQA)

Grouped-Query Attention (GQA) dramatically reduces Key-Value (KV) cache memory footprints by sharing a single set of Key/Value heads across multiple Query

Published June 1, 2026 · By MortalApps · 5 min read · ~961 words

TL;DR

Grouped-Query Attention (GQA) dramatically reduces Key-Value (KV) cache memory footprints by sharing a single set of Key/Value heads across multiple Query heads.
It interpolates between Multi-Head Attention (MHA) and Multi-Query Attention (MQA), striking a precise balance between memory efficiency and reasoning quality.
For a model with query heads and KV heads, GQA achieves an reduction factor in the KV cache size.
Frontier models like Llama 3 rely heavily on GQA, utilizing 8 KV heads for 64 query heads to manage extreme context lengths effectively.

Why This Matters Intuition Deep Dive Takeaways Related

Why This Matters

As context windows scale from 8K to 128K tokens and beyond, KV cache size rapidly overshadows model weights. A single Llama 3 70B request at a 128K context length consumes approximately 42 GB of GPU memory for the KV cache alone. Without GQA, standard MHA would demand over 300 GB for the same request, making concurrent multi-user serving physically impossible on a standard 8-GPU node. GQA is the architectural backbone that enables the economic feasibility of long-context LLMs in production environments.

Core Intuition

In MHA, every query head gets its own dedicated key and value head. This is akin to assigning a dedicated librarian (KV) to every single student (Q) in a library. In MQA, all students share exactly one librarian, which is incredibly fast but creates bottlenecks in complex lookups, leading to quality degradation. GQA groups students into cohorts and assigns one librarian per cohort. By sharing 1 KV head across 8 Q heads, the memory footprint drops by 87.5% while retaining enough multidimensional representational capacity to maintain high reasoning performance.

Technical Deep Dive

The KV Cache size for GQA can be calculated deterministically and scales linearly with sequence length:

Where is the number of layers, is batch size, is the number of KV groups (heads), is head dimension, and is sequence length. If Llama 3 70B operates with 64 query heads and 8 KV heads (), it requires exactly 8 times less memory than an MHA equivalent where . The execution semantics broadcast the KV heads across the Q heads dynamically during the attention matrix multiplication, mapping a shape of (B, 8, S, D) to (B, 64, S, D) implicitly within the CUDA kernel without ever materializing the duplicated tensors in global memory.

Key Takeaways

GQA groups multiple query heads to share a single KV head pair.

It reduces memory consumption by an exact factor of

compared to standard MHA.

It preserves near-MHA accuracy while achieving near-MQA inference speeds.

GQA fundamentally dictates the maximum batch sizes and sequence lengths achievable in modern LLM serving.

Architecture Setup	Query Heads (H)
KV Heads (G)	KV Cache Size (32 layers, 128 dim, 1024 tokens)
Savings vs MHA	MHA
16 MB per layer (512 MB total)	1x
GQA	4 MB per layer (128 MB total)
4x	MQA
0.5 MB per layer (16 MB total)	32x

Why This Matters

Core Intuition

Technical Deep Dive

Key Takeaways

Internals / Execution Flow

Performance Implications

Real-World Usage

Common Bottlenecks

Optimization Strategies

Tools & Frameworks

Interview Guidance

Performance Comparisons

Related Concepts