Topic Hub
Tensor Computing & GPU Execution Foundations
Matrix Multiplication (GEMM) Execution Mechanics
Tensor Layouts and Memory Ordering
SIMD vs SIMT Execution Models
Streaming Multiprocessor (SM) Architecture
CUDA Warp Scheduling and Divergence
Tensor Core Execution Systems
GPU Occupancy and Register Allocation
Arithmetic Intensity and Roofline Modeling
Compute-Bound vs Memory-Bound Workloads
Instruction-Level Parallelism in GPUs
PTX Assembly and SASS Fundamentals
Thread Block and Grid Synchronization
Asynchronous Compute and Overlapped Execution
Cooperative Groups and Distributed Shared Memory
GPU Hardware Generations (H100 → B200 → Future Architectures)