Activation Quantization
Activation quantization compresses the dynamic, continuously changing input vectors of a neural network layer alongside the static model weights.
Source: mortalapps.com- Activation quantization compresses the dynamic, continuously changing input vectors of a neural network layer alongside the static model weights.
- The core purpose is to transition matrix multiplications from memory-bound acceleration into true compute-bound acceleration by enabling native integer (INT8) or FP8 Tensor Core math.
- The primary optimization difficulty is managing the massive, systematic outliers uniquely present in activation channels, which destroy naive quantization scales.
- The critical engineering insight is that hardware GEMM kernels physically prohibit scaling across the inner reduction dimension, making offline mathematical smoothing mandatory to suppress outliers.
Why This Matters
While weight-only quantization drastically improves memory bandwidth, it fails to accelerate the actual mathematical computation (FLOPs). To unlock the staggering compute density of integer or FP8 Tensor Cores for massive prefilling operations and high-throughput batching, both the weights and the activations must be quantized (W8A8 or W4A4). Without activation quantization, systems hit a strict compute ceiling that leaves the latest generations of hardware severely underutilized.
Core Intuition
Model weights are statically trained; their distribution is relatively stable, centered near zero, and predictable, making them easy to compress. Activations, however, are dynamic and chaotic. Specifically, in models exceeding 6 billion parameters, activations exhibit structural anomalies: certain hidden channels consistently output values up to 100 times larger than the average. If you attempt to compress an entire tensor into 256 discrete bins (8-bit) using a single global scale, the single massive outlier dictates the boundaries. Consequently, all the "normal" data points collapse into the zero bin, entirely erasing the neural network's informational state.
Technical Deep Dive
For a standard linear layer computation :
represents the input activations across tokens
and channels
.
represents the weights.
If a specific channel in
contains an extreme outlier, standard per-token quantization (scaling each row of
) fails completely because the outlier dominates the scale for that entire token sequence. Conversely, per-channel activation quantization (scaling the columns of
) is mathematically sound but structurally impossible at the hardware level. cuBLAS GEMM kernels require the scale multipliers to exist on the outer dimensions of the matrix multiplication so they can be factored out of the dot product sum. Scaling the inner reduction dimension breaks the hardware's internal accumulation logic.