← Infrastructure Tensor Computing
Infrastructure

Arithmetic Intensity and Roofline Modeling

Arithmetic Intensity = FLOPS / Bytes Transferred. The Roofline Model maps peak compute against peak memory bandwidth.

Source: mortalapps.com
TL;DR
  • Arithmetic Intensity () = FLOPS / Bytes Transferred.
  • The Roofline Model maps peak compute against peak memory bandwidth.
  • B200 Ridge Point for FP4 is ~2500 FLOPS/Byte.
  • Most LLM inference phases fail to reach this ridge point and remain memory-bound.

Why This Matters

You cannot optimize a system if you don't know the theoretical ceiling. The Roofline Model tells an infrastructure engineer whether they should spend their time optimizing the memory layout (if memory-bound) or optimizing the inner math loop (if compute-bound). Blindly applying compute optimizations to a memory-bound workload yields 0% speedup.

Core Intuition

If you have a hyper-fast blender (Tensor Cores) but a very slow conveyor belt feeding it fruit (Memory Bandwidth), the blender will spend most of its time empty. The only way to keep the blender busy is to chop the same piece of fruit a thousand times before asking for the next one. This is Arithmetic Intensity (Data Reuse).

Technical Deep Dive

The Roofline Model establishes two physical ceilings for an algorithm:

Compute Roof (Flat): Peak teraFLOPS (e.g., 20 PFLOPS for FP4 on B200).

Memory Roof (Slanted): Peak Memory Bandwidth (e.g., 8 TB/s HBM3e on B200). The formula for Arithmetic Intensity is:

The Ridge Point is where the slanted roof meets the flat roof:

If a kernel's is less than 2500 FLOPS/Byte, it is fundamentally bounded by the memory bandwidth.

Key Takeaways

Arithmetic Intensity = Data Reuse.
B200 FP4 Ridge point requires 2500 math operations per byte fetched.
Operator Fusion increases arithmetic intensity.
HBM bandwidth is the primary bottleneck for most AI workloads.