Asynchronous Compute and Overlapped Execution
Hardware limitations demand overlapping memory fetching with compute. Tensor Memory Accelerator (TMA) is the autonomous hardware DMA engine.
Source: mortalapps.com- Hardware limitations demand overlapping memory fetching with compute.
- Tensor Memory Accelerator (TMA) is the autonomous hardware DMA engine.
- TMA bypasses the SM execution pipeline completely.
- Driven by cp.async.bulk.tensor and mbarrier synchronization.
Why This Matters
An infrastructure engineer cannot afford to halt a 20 PFLOPS Tensor Core engine to wait for a global memory fetch. The transition from synchronous to asynchronous execution is the defining characteristic of modern AI performance engineering, directly enabling the throughput of frontier LLMs.
Core Intuition
Synchronous compute is driving to the store, buying groceries, driving home, and then starting to cook. Asynchronous compute is hiring a delivery driver (TMA). You call the driver (Async Copy), start cooking the ingredients you already have (Compute), and the driver autonomously drops the new groceries directly into your fridge (Shared Memory) while you never leave the kitchen.
Technical Deep Dive
Historically, LDG instructions forced the SM's execution units to actively manage memory transactions. Hopper and Blackwell feature the Tensor Memory Accelerator (TMA), a fully autonomous copy engine. When a warp issues cp.async.bulk.tensor, it passes a multi-dimensional "tensor map" (created on the host) to the TMA. The TMA operates in an "Async Proxy"—a separate hardware channel disconnected from standard memory operations. The TMA handles all index calculations, out-of-bounds checking, and multi-dimensional strides automatically, streaming the data directly from HBM into SMEM.