Hadamard Outlier Mitigation
Methods like QuaRot and SpinQuant multiply model weights and activations by orthogonal (rotation) matrices offline to distribute massive outliers evenly
Source: mortalapps.com- Methods like QuaRot and SpinQuant multiply model weights and activations by orthogonal (rotation) matrices offline to distribute massive outliers evenly across all channels.
- The core purpose is to eradicate systemic activation outliers entirely, making extreme low-bit (e.g., W4A4KV4) quantization viable without accuracy collapse.
- The primary optimization uses structured Walsh-Hadamard matrices or learned Stiefel-manifold rotations to flatten variance while maintaining mathematical equivalence.
- The essential engineering insight is that rotating a vector mathematically transforms its coordinates to behave as if they are independent and identically distributed (i.i.d.), flattening the distribution without altering the true Euclidean norm.
Why This Matters
While SmoothQuant effectively enables 8-bit deployment, it falters at 4-bit (W4A4). At 4-bit precision, even the residual "smoothed" outliers cause fatal quantization noise. By fundamentally transforming the coordinate space of the neural network using complex rotations, architectures like QuaRot and SpinQuant remove outliers completely. This algebraic intervention allows frontier models like LLaMA-3 to run entirely in 4-bit logic (encompassing weights, activations, and the KV cache) on consumer-grade hardware.
Core Intuition
Imagine a massive matrix where one specific column contains an astronomical spike (an outlier). If you rotate this entire matrix in multi-dimensional space, that single localized spike is mathematically projected—or smeared—across all columns. Because rotation preserves Euclidean distance, the total mathematical magnitude of the matrix remains exactly the same, but no single column contains a spike anymore. The data is now outlier-free, tightly bounded, and easily fits into tiny 4-bit quantization bins.
Technical Deep Dive
QuaRot relies on highly structured Hadamard matrices. An Hadamard matrix
is orthogonal, meaning
. The Walsh-Hadamard transform computes
in
operations without executing any floating-point multiplications, making the dynamic online application of the matrix exceptionally fast in hardware. SpinQuant was developed to address QuaRot's primary weakness: random orthogonal rotations produce high variance in downstream zero-shot accuracy. SpinQuant actively learns the optimal rotation matrices on the Stiefel manifold (the set of all orthonormal matrices) using Cayley SGD optimization, directly minimizing the quantization loss function during a calibration phase.
. The final output remains mathematically identical, but the internal intermediary state is outlier-free".48