PTX Assembly and SASS Fundamentals
PTX is the virtual machine instruction set; SASS is the physical binary assembly.
Source: mortalapps.com- PTX is the virtual machine instruction set; SASS is the physical binary assembly.
- Modern AI relies on bulk PTX operations: cp.async.bulk.tensor (TMA).
- Blackwell introduces the massive tcgen05 family for TMEM and MMA management.
- SASS exposes the hidden reality of hardware: Scoreboards, B-registers, and precise register allocation.
Why This Matters
High-level PyTorch abstracts hardware inefficiencies. Triton abstracts the CUDA model. But when a production kernel is running 10% below the Roofline ceiling, the leak is in the SASS. Infrastructure engineers must read SASS to verify that the compiler successfully utilized the TMA, allocated TMEM, and avoided register spills.
Core Intuition
PTX is the blueprint. SASS is the physical building. The compiler (ptxas) translates the blueprint to the building. Sometimes, the compiler makes poor assumptions. Reading PTX tells you what you asked the GPU to do; reading SASS tells you what the GPU is actually doing.
Technical Deep Dive
PTX 8.7 (Compute Capability 10.0a) introduces the tcgen05 family for Blackwell.
TMEM Management: tcgen05.alloc, tcgen05.dealloc, tcgen05.relinquish_alloc_permit. TMEM is not addressed via standard pointers; it requires explicit allocation and lifecycle tracking in PTX.
Data Movement: tcgen05.ld, tcgen05.st move data between Registers, SMEM, and TMEM.
MMA: tcgen05.mma initiates the actual tensor math.
Furthermore, asynchronous memory is managed via cp.async.bulk.tensor, replacing legacy LDG loops. Blackwell also supports cp.reduce.async.bulk.tensor, which allows the hardware to perform asynchronous reductions (.add, .min, .max, .xor) directly in the memory payload without burning SM execution cycles.