Grouped-Query Attention (GQA)
Grouped-Query Attention (GQA) dramatically reduces Key-Value (KV) cache memory footprints by sharing a single set of Key/Value heads across multiple Query
Source: mortalapps.com- Grouped-Query Attention (GQA) dramatically reduces Key-Value (KV) cache memory footprints by sharing a single set of Key/Value heads across multiple Query heads.
- It interpolates between Multi-Head Attention (MHA) and Multi-Query Attention (MQA), striking a precise balance between memory efficiency and reasoning quality.
- For a model with
query heads and
KV heads, GQA achieves an
reduction factor in the KV cache size.
- Frontier models like Llama 3 rely heavily on GQA, utilizing 8 KV heads for 64 query heads to manage extreme context lengths effectively.
Why This Matters
As context windows scale from 8K to 128K tokens and beyond, KV cache size rapidly overshadows model weights. A single Llama 3 70B request at a 128K context length consumes approximately 42 GB of GPU memory for the KV cache alone. Without GQA, standard MHA would demand over 300 GB for the same request, making concurrent multi-user serving physically impossible on a standard 8-GPU node. GQA is the architectural backbone that enables the economic feasibility of long-context LLMs in production environments.
Core Intuition
In MHA, every query head gets its own dedicated key and value head. This is akin to assigning a dedicated librarian (KV) to every single student (Q) in a library. In MQA, all students share exactly one librarian, which is incredibly fast but creates bottlenecks in complex lookups, leading to quality degradation. GQA groups students into cohorts and assigns one librarian per cohort. By sharing 1 KV head across 8 Q heads, the memory footprint drops by 87.5% while retaining enough multidimensional representational capacity to maintain high reasoning performance.
Technical Deep Dive
The KV Cache size for GQA can be calculated deterministically and scales linearly with sequence length:

Where is the number of layers,
is batch size,
is the number of KV groups (heads),
is head dimension, and
is sequence length. If Llama 3 70B operates with 64 query heads and 8 KV heads (
), it requires exactly 8 times less memory than an MHA equivalent where
. The execution semantics broadcast the KV heads across the Q heads dynamically during the attention matrix multiplication, mapping a shape of (B, 8, S, D) to (B, 64, S, D) implicitly within the CUDA kernel without ever materializing the duplicated tensors in global memory.