← Infrastructure Quantization

Infrastructure

Matrix Multiplication (GEMM) Execution Mechanics
Tensor Layouts and Memory Ordering
SIMD vs SIMT Execution Models
Streaming Multiprocessor (SM) Architecture
CUDA Warp Scheduling and Divergence
Tensor Core Execution Systems
GPU Occupancy and Register Allocation
Arithmetic Intensity and Roofline Modeling
Compute-Bound vs Memory-Bound Workloads
Instruction-Level Parallelism in GPUs
PTX Assembly and SASS Fundamentals
Thread Block and Grid Synchronization
Asynchronous Compute and Overlapped Execution
Cooperative Groups and Distributed Shared Memory
GPU Hardware Generations (H100 → B200 → Future Architectures)

HBM3e and Modern VRAM Architectures
GPU Memory Hierarchy (Registers, Shared, Global)
Shared Memory Allocation and Bank Conflicts
Global Memory Coalescing Techniques
Cache Hierarchies and Hit Rate Optimization
Tensor Memory Accelerator (TMA) Systems
Asynchronous Data Movement Pipelines
Pinned Memory and PCIe Transfers
NVLink Memory Communication
Unified Memory and Page Fault Handling
GPU Memory Fragmentation
KV Cache Memory Management
Cache-Aware Tiling Strategies
Register Spilling and Resource Exhaustion
Compute-Data Movement Overlap Algorithms

CUDA Kernel Programming Fundamentals
Triton Compiler Architecture
PTX Lowering and Kernel Translation
MLIR Infrastructure for AI Systems
XLA Graph Optimization
PyTorch Inductor and torch.compile
Operator Fusion Mechanisms
Kernel Fusion and Memory Reduction
TensorRT Compilation Pipelines
CUDA Graphs and Launch Overhead Elimination
JIT Compilation Systems
Auto-Tuning and Kernel Search Spaces
Loop Unrolling and Instruction Scheduling
Online Softmax Computation
Compiler-Driven Runtime Optimization

Standard Multi-Head Attention Bottlenecks
FlashAttention-2 Memory Optimization
FlashAttention-3 Asynchronous Execution
Attention Block Tiling Strategies
Grouped-Query Attention (GQA)
Multi-Query Attention (MQA)
KV Cache Architecture
PagedAttention Systems
RadixAttention and Prefix Reuse
Sliding Window Attention
Ring Attention for Long Contexts
Context Parallelism in Attention
Deterministic Attention Scheduling
Prefix Tree KV Cache Eviction
Long-Context Inference Scaling

Continuous Batching Systems
Token-Level Runtime Scheduling
Prefill vs Decode Architecture
Time-To-First-Token (TTFT) Optimization
Speculative Decoding Systems
EAGLE and EAGLE-3 Drafting
Dynamic Batch Size Tuning
Pipeline Bubble Elimination
Structured Generation Pipelines
Multi-turn Context Sharing
Chunked Prefill Processing
Multi-GPU Inference Orchestration
SLA-Aware Request Scheduling
Cache-Aware Scheduling Policies
Production Inference Latency Optimization

FP16 vs BF16 vs FP8 Runtime Behavior
NVFP4 and Blackwell FP4 Systems
MXFP4 Microscaling Architectures
Dynamic Scaling Factors
Weight-Only Quantization
Activation Quantization
AWQ Quantization Systems
GPTQ Quantization
MR-GPTQ Runtime Optimization
SmoothQuant Outlier Suppression
Hadamard Outlier Mitigation
KV Cache Quantization
Low-Precision Matrix Multiplication
Phase-Aware Quantization (Mix-Quant)
Calibration and Accuracy Recovery

Distributed Data Parallelism (DDP)
Fully Sharded Data Parallelism (FSDP)
ZeRO Optimization Architecture
Tensor Parallelism
Pipeline Parallelism
Sequence Parallelism
Context Parallelism
Expert Parallelism for MoE
3D Parallelism Topologies
Multi-Dimensional Sharding Strategies
Micro-batching Algorithms
Distributed Optimizer Checkpointing
Megatron-LM Parallelism Mechanics
Collective Communication Scaling
Infinite-Context Distributed Training

NCCL Architecture and AllReduce
Ring vs Tree Communication Topologies
NVLink and NVSwitch Systems
GB200 NVL72 Rack-Scale Architecture
PCIe Gen5 Bottlenecks
GPU Direct RDMA
InfiniBand NDR Networks
Ultra Ethernet Consortium (UEC)
Lossless Ethernet and Packet Trimming
Cross-Rack GPU Communication
Communication-Computation Overlap
Cluster Fabric Topology Design
Scale-Up vs Scale-Out Architectures
Distributed Communication Profiling
Network Congestion and Routing Analysis

vLLM Runtime Architecture
SGLang Execution Systems
TensorRT-LLM Serving Pipelines
Triton Inference Server Architecture
Kubernetes for AI Workloads
GPU Scheduling and Resource Allocation
Slurm and HPC Scheduling
Ray Serve and Distributed Serving
Multi-Model GPU Serving
Inference Autoscaling Systems
Fault Tolerance in AI Inference
GPU Isolation and Multi-Tenancy
Cluster Resource Telemetry
Agentic Workflow Infrastructure
Cost-Aware AI Infrastructure Scaling

NVIDIA Nsight Systems Profiling
Nsight Compute Kernel Analysis
PyTorch Profiler Workflows
Roofline Modeling and Performance Bounds
Memory-Bound vs Compute-Bound Diagnostics
GPU Utilization and Occupancy Tracking
CUDA Out-of-Memory Diagnostics
NCCL Debugging and Topology Validation
Distributed Trace Analysis
TTrace and Distributed Bug Localization
Communication Stall Diagnostics
Host-Side Data Pipeline Bottlenecks
Power Management and Thermal Throttling
Cloud Datacenter Telemetry Pipelines
End-to-End AI System Performance Engineering

Infrastructure› Quantization, FP8 & Low-Precision AI Systems

Topic Hub

Quantization, FP8 & Low-Precision AI Systems

FP16 vs BF16 vs FP8 Runtime Behavior

NVFP4 and Blackwell FP4 Systems

MXFP4 Microscaling Architectures

Dynamic Scaling Factors

Weight-Only Quantization

Activation Quantization

AWQ Quantization Systems

GPTQ Quantization

MR-GPTQ Runtime Optimization

SmoothQuant Outlier Suppression

Hadamard Outlier Mitigation

KV Cache Quantization

Low-Precision Matrix Multiplication

Phase-Aware Quantization (Mix-Quant)

Calibration and Accuracy Recovery