← Infrastructure Transformer Systems
Infrastructure

Grouped-Query Attention (GQA)

Grouped-Query Attention (GQA) dramatically reduces Key-Value (KV) cache memory footprints by sharing a single set of Key/Value heads across multiple Query

Source: mortalapps.com
TL;DR
  • Grouped-Query Attention (GQA) dramatically reduces Key-Value (KV) cache memory footprints by sharing a single set of Key/Value heads across multiple Query heads.
  • It interpolates between Multi-Head Attention (MHA) and Multi-Query Attention (MQA), striking a precise balance between memory efficiency and reasoning quality.
  • For a model with query heads and KV heads, GQA achieves an reduction factor in the KV cache size.
  • Frontier models like Llama 3 rely heavily on GQA, utilizing 8 KV heads for 64 query heads to manage extreme context lengths effectively.

Why This Matters

As context windows scale from 8K to 128K tokens and beyond, KV cache size rapidly overshadows model weights. A single Llama 3 70B request at a 128K context length consumes approximately 42 GB of GPU memory for the KV cache alone. Without GQA, standard MHA would demand over 300 GB for the same request, making concurrent multi-user serving physically impossible on a standard 8-GPU node. GQA is the architectural backbone that enables the economic feasibility of long-context LLMs in production environments.

Core Intuition

In MHA, every query head gets its own dedicated key and value head. This is akin to assigning a dedicated librarian (KV) to every single student (Q) in a library. In MQA, all students share exactly one librarian, which is incredibly fast but creates bottlenecks in complex lookups, leading to quality degradation. GQA groups students into cohorts and assigns one librarian per cohort. By sharing 1 KV head across 8 Q heads, the memory footprint drops by 87.5% while retaining enough multidimensional representational capacity to maintain high reasoning performance.

Technical Deep Dive

The KV Cache size for GQA can be calculated deterministically and scales linearly with sequence length:

Where is the number of layers, is batch size, is the number of KV groups (heads), is head dimension, and is sequence length. If Llama 3 70B operates with 64 query heads and 8 KV heads (), it requires exactly 8 times less memory than an MHA equivalent where . The execution semantics broadcast the KV heads across the Q heads dynamically during the attention matrix multiplication, mapping a shape of (B, 8, S, D) to (B, 64, S, D) implicitly within the CUDA kernel without ever materializing the duplicated tensors in global memory.

Key Takeaways

GQA groups multiple query heads to share a single KV head pair.
It reduces memory consumption by an exact factor of compared to standard MHA.
It preserves near-MHA accuracy while achieving near-MQA inference speeds.
GQA fundamentally dictates the maximum batch sizes and sequence lengths achievable in modern LLM serving.