GPU Memory Systems

Unified Memory and Page Fault Handling

Unified Memory (UVM) creates a single, coherent virtual address space spanning both CPU system memory and GPU VRAM.

Published June 1, 2026 · By MortalApps · 5 min read · ~830 words

TL;DR

Unified Memory (UVM) creates a single, coherent virtual address space spanning both CPU system memory and GPU VRAM.
The core purpose is simplifying memory management for developers by automatically migrating data via demand-driven page faults.
The primary optimization idea is using explicit memory prefetching to completely mask the crippling latency of on-demand hardware page migration.
The most important engineering insight is that relying entirely on hardware-managed page faulting will destroy kernel performance; UVM is an architectural convenience, not a high-performance primitive.

Why This Matters Intuition Deep Dive Takeaways Related

Why This Matters

For enormous datasets or graph processing algorithms where memory access patterns are unpredictable and sparse, Unified Memory allows GPU kernels to address data larger than physical VRAM by dynamically swapping pages from the CPU. However, if UVM is heavily utilized in latency-sensitive AI kernels without aggressive prefetching, the hidden migration costs result in severe pipeline stalls, rendering the GPU's compute capability useless.

Core Intuition

Think of Unified Memory as a massive shared library. Both the CPU and GPU have local reading desks, but only one physical copy of a specific book exists. When the GPU asks for a book that is currently on the CPU's desk, it generates a "page fault." The librarian pauses the GPU's reading, physically walks the book over to the GPU's desk via the PCIe bus, and then resumes the process. This walking time is invisible to the user's high-level code, but highly visible to the stopwatch.

Technical Deep Dive

UVM is accessed primarily via cudaMallocManaged. When a GPU SM attempts to execute a global load for a virtual address that is currently not resident in VRAM, the GPU MMU looks up the address in the TLB. On a miss, it walks the device page table. If there is no mapping, the GPU hardware raises a page fault.

The faulting warp stalls, though other warps on the same SM can continue if they have independent work. The CUDA driver handles the fault by allocating a physical GPU page, typically at a 2MB granularity, and migrating the data over PCIe. Modern implementations support Heterogeneous Memory Management (HMM) and Address Translation Services (ATS). HMM uses the Linux kernel to resolve host virtual addresses and migrate pages without specialized hardware. ATS relies on hardware coherency (like NVLink on POWER architectures), omitting software page-faulting overheads entirely.

Key Takeaways

UVM allows GPU kernels to access CPU memory directly, masking PCIe transfers via automatic driver-level page fault handling.

GPU-resident pages are allocated in 2MB chunks, meaning small access footprints still trigger large migrations.

HMM provides software coherency via the Linux kernel, while ATS provides highly efficient hardware-level coherency.

Relying on runtime page faults rather than explicit prefetching degrades bandwidth and causes severe SM execution stalls.

Why This Matters

Core Intuition

Technical Deep Dive

Key Takeaways

Internals / Execution Flow

Performance Implications

Real-World Usage

Common Bottlenecks

Optimization Strategies

Tools & Frameworks

Interview Guidance

Related Concepts