← Infrastructure Quantization
Infrastructure

Activation Quantization

Activation quantization compresses the dynamic, continuously changing input vectors of a neural network layer alongside the static model weights.

Source: mortalapps.com
TL;DR
  • Activation quantization compresses the dynamic, continuously changing input vectors of a neural network layer alongside the static model weights.
  • The core purpose is to transition matrix multiplications from memory-bound acceleration into true compute-bound acceleration by enabling native integer (INT8) or FP8 Tensor Core math.
  • The primary optimization difficulty is managing the massive, systematic outliers uniquely present in activation channels, which destroy naive quantization scales.
  • The critical engineering insight is that hardware GEMM kernels physically prohibit scaling across the inner reduction dimension, making offline mathematical smoothing mandatory to suppress outliers.

Why This Matters

While weight-only quantization drastically improves memory bandwidth, it fails to accelerate the actual mathematical computation (FLOPs). To unlock the staggering compute density of integer or FP8 Tensor Cores for massive prefilling operations and high-throughput batching, both the weights and the activations must be quantized (W8A8 or W4A4). Without activation quantization, systems hit a strict compute ceiling that leaves the latest generations of hardware severely underutilized.

Core Intuition

Model weights are statically trained; their distribution is relatively stable, centered near zero, and predictable, making them easy to compress. Activations, however, are dynamic and chaotic. Specifically, in models exceeding 6 billion parameters, activations exhibit structural anomalies: certain hidden channels consistently output values up to 100 times larger than the average. If you attempt to compress an entire tensor into 256 discrete bins (8-bit) using a single global scale, the single massive outlier dictates the boundaries. Consequently, all the "normal" data points collapse into the zero bin, entirely erasing the neural network's informational state.

Technical Deep Dive

For a standard linear layer computation :

represents the input activations across tokens and channels .

represents the weights.

If a specific channel in contains an extreme outlier, standard per-token quantization (scaling each row of ) fails completely because the outlier dominates the scale for that entire token sequence. Conversely, per-channel activation quantization (scaling the columns of ) is mathematically sound but structurally impossible at the hardware level. cuBLAS GEMM kernels require the scale multipliers to exist on the outer dimensions of the matrix multiplication so they can be factored out of the dot product sum. Scaling the inner reduction dimension breaks the hardware's internal accumulation logic.

Key Takeaways

Weights are statistically stable; activations are highly dynamic and outlier-heavy.
Runtime per-tensor activation scaling collapses normal distributions to zero.
Hardware GEMM logic physically prohibits per-channel activation scales.
Dynamic outlier extraction (FP16 routing) causes severe latency regressions.
True W8A8 acceleration requires offline outlier smoothing or aggressive clipping.