← Infrastructure GPU Memory Systems
Infrastructure

GPU Memory Fragmentation

GPU memory fragmentation occurs when free memory is divided into small, non-contiguous blocks, leading to Out-Of-Memory (OOM) errors despite sufficient

Source: mortalapps.com
TL;DR
  • GPU memory fragmentation occurs when free memory is divided into small, non-contiguous blocks, leading to Out-Of-Memory (OOM) errors despite sufficient total free space.
  • The core purpose of addressing it is maintaining continuous system uptime and maximizing usable VRAM for highly dynamic AI workloads.
  • The primary optimization idea is decoupling virtual memory from physical memory using low-level CUDA Virtual Memory Management (VMM) APIs.
  • The most important engineering insight is that relying on default memory allocators (cudaMalloc) in long-running processes guarantees catastrophic external fragmentation over time.

Why This Matters

In production LLM serving, sequence lengths are highly dynamic. Tensors are constantly created, resized, and destroyed. Standard memory allocation inevitably leads to "Swiss cheese" VRAM—plenty of aggregate free space, but no single contiguous block large enough for a new 1GB tensor request. This paradox severely limits system concurrency, triggers unnecessary garbage collection, and causes production service crashes.

Core Intuition

Imagine a large parking lot. If motorcycles (small tensors) park randomly across the lot, leaving only a half-space between each, a large bus (a large tensor) cannot park, even if half the total lot is empty. Virtual Memory Management fixes this by acting like a highly efficient valet: it gives the bus driver a single contiguous virtual ticket, while secretly cutting the bus into pieces and parking them in the scattered physical spaces throughout the lot.

Technical Deep Dive

Frameworks like PyTorch use a "Caching Allocator" to combat the latency of cudaMalloc. It allocates huge blocks and sub-allocates them to tensors. However, when a tensor is freed, PyTorch retains the memory in its pool. nvidia-smi might show 95% utilization, but the caching allocator is actually holding heavily fragmented, unused pools.

To resolve physical fragmentation at the driver level, NVIDIA introduced Virtual Memory Management (VMM) APIs. These separate memory allocation into distinct steps 57:

cuMemAddressReserve: Reserves a massive, contiguous virtual address range.

cuMemCreate: Allocates a physical memory handle (at a strict 2MB page granularity).

cuMemMap: Maps the physical allocation to the reserved virtual address range.

cuMemSetAccess: Grants necessary read/write permissions.

Key Takeaways

Fragmentation limits long-running dynamic workloads by creating unusable holes in physical memory.
PyTorch's caching allocator holds freed memory to avoid cudaMalloc latency, masking true usage.
CUDA VMM APIs (cuMemAddressReserve, cuMemMap) separate virtual addresses from physical allocations, allowing non-contiguous physical pages to appear virtually contiguous.
Physical page allocations via cuMemCreate are fixed at 2MB granularities.