Quantization

MR-GPTQ Runtime Optimization

Micro-Rotated-GPTQ (MR-GPTQ) is an advanced evolution of the GPTQ algorithm tailored to counteract the severe accuracy degradation inherent in ultra-low

Published June 1, 2026 · By MortalApps · 5 min read · ~840 words

TL;DR

Micro-Rotated-GPTQ (MR-GPTQ) is an advanced evolution of the GPTQ algorithm tailored to counteract the severe accuracy degradation inherent in ultra-low precision FP4 (4-bit floating point) formats.
The core purpose is to make formats like NVFP4 and MXFP4 mathematically viable for production deployment without suffering the massive accuracy collapse seen when transitioning from INT4.
The primary optimization fuses block-wise Hadamard rotations with a format-specific scale search directly into the GPTQ pipeline.
The essential engineering insight is that standard global rotational invariance fails under FP4 due to micro-scale saturation; MR-GPTQ instead injects localized micro-rotations to flatten variance exactly within the hardware's strict block dimensions.

Why This Matters Intuition Deep Dive Takeaways Related

Why This Matters

While the Blackwell architecture natively supports MXFP4 and NVFP4, empirical research revealed a harsh reality: FP4 is not an automatic upgrade over INT4. Native Post-Training Quantization (PTQ) directly into FP4 suffers catastrophic accuracy drops on massive language models because the non-linear bin spacing of floating-point formats violently amplifies outliers. MR-GPTQ bridges this critical gap, recovering up to 96.1% of the original FP16 accuracy. It makes the promise of ultra-low precision FP4 hardware acceleration a practical reality for production.

Core Intuition

Hadamard rotations are mathematical transformations that smear extreme outliers, effectively flattening the distribution of a tensor to make it easier to quantize. However, applying a massive global rotation to an entire matrix ignores the architecture of FP4 formats, which calculate their scaling factors locally in tiny micro-blocks of 16 or 32 elements. If variance isn't smoothed within those tiny blocks, the scales still saturate. MR-GPTQ aligns the mathematical rotation matrix perfectly with the hardware's micro-scaling block dimensions, ensuring that the local variance within every single micro-block is flattened, yielding an optimal, tight FP4 scale.

Technical Deep Dive

For a linear layer containing quantized weights and activations , MR-GPTQ introduces block-wise Hadamard rotations . These consist of diagonal blocks where is a power-of-two strictly aligned with the hardware's block size. Mathematically, the operation manifests as:

where is the highly-constrained quantization function. MR-GPTQ employs format-specialized strategies to navigate different hardware limits:

MXFP4 (Block size, E8M0 scale): Due to the wider block size and restrictive integer-only scale, this format requires aggressive micro-rotation to mitigate local variance.

NVFP4 (Block size, E4M3 scale): Given its tighter block size and floating-point scale, MR-GPTQ utilizes a highly targeted scale search optimization deep within the GPTQ error compensation loop, fine-tuning the bins to minimize error.

Key Takeaways

FP4 natively degrades accuracy compared to INT4 without specialized handling.

MR-GPTQ fuses micro-block Hadamard rotations (

) precisely matched to hardware block limits.

Incorporates format-specific (E4M3 vs E8M0) scale search into the compensation loop.

Dynamic inverse rotations are applied via zero-overhead CUTLASS epilogue fusions.

Effectively closes the accuracy gap between vendor-agnostic MXFP4 and proprietary NVFP4.

Why This Matters

Core Intuition

Technical Deep Dive

Key Takeaways

Internals / Execution Flow

Performance Implications

Real-World Usage

Common Bottlenecks

Optimization Strategies

Tools & Frameworks

Interview Guidance

Related Concepts