Tensor Computing

Asynchronous Compute and Overlapped Execution

Hardware limitations demand overlapping memory fetching with compute. Tensor Memory Accelerator (TMA) is the autonomous hardware DMA engine.

Published June 1, 2026 · By MortalApps · 3 min read · ~498 words

TL;DR

Hardware limitations demand overlapping memory fetching with compute.
Tensor Memory Accelerator (TMA) is the autonomous hardware DMA engine.
TMA bypasses the SM execution pipeline completely.
Driven by cp.async.bulk.tensor and mbarrier synchronization.

Why This Matters Intuition Deep Dive Takeaways Related

Why This Matters

An infrastructure engineer cannot afford to halt a 20 PFLOPS Tensor Core engine to wait for a global memory fetch. The transition from synchronous to asynchronous execution is the defining characteristic of modern AI performance engineering, directly enabling the throughput of frontier LLMs.

Core Intuition

Synchronous compute is driving to the store, buying groceries, driving home, and then starting to cook. Asynchronous compute is hiring a delivery driver (TMA). You call the driver (Async Copy), start cooking the ingredients you already have (Compute), and the driver autonomously drops the new groceries directly into your fridge (Shared Memory) while you never leave the kitchen.

Technical Deep Dive

Historically, LDG instructions forced the SM's execution units to actively manage memory transactions. Hopper and Blackwell feature the Tensor Memory Accelerator (TMA), a fully autonomous copy engine. When a warp issues cp.async.bulk.tensor, it passes a multi-dimensional "tensor map" (created on the host) to the TMA. The TMA operates in an "Async Proxy"—a separate hardware channel disconnected from standard memory operations. The TMA handles all index calculations, out-of-bounds checking, and multi-dimensional strides automatically, streaming the data directly from HBM into SMEM.

Key Takeaways

TMA is a dedicated hardware engine for bulk data movement.

Eliminates software pointer arithmetic and bounds checking.

Operates in an Async Proxy, requiring mbarrier fences.

Overlapping TMA loads with MMA compute is mandatory for peak efficiency.

Why This Matters

Core Intuition

Technical Deep Dive

Key Takeaways

Internals / Execution Flow

Performance Implications

Real-World Usage

Common Bottlenecks

Optimization Strategies

Tools & Frameworks

Interview Guidance

Related Concepts