Quantization

NVFP4 and Blackwell FP4 Systems

NVFP4 is NVIDIAs proprietary 4-bit floating-point format designed specifically for the Blackwell GPU architecture.

Published June 1, 2026 · By MortalApps · 5 min read · ~955 words

TL;DR

NVFP4 is NVIDIAs proprietary 4-bit floating-point format designed specifically for the Blackwell GPU architecture.
The core purpose of this format is to achieve 4-bit model density while mitigating the severe accuracy degradation inherent in standard integer or global-scale 4-bit formats.
The primary optimization is a two-level micro-block scaling architecture utilizing a high-precision E4M3 scale per tightly packed 16-element block.
The critical engineering insight is that reducing the block size to 16 localized elements prevents massive variance within the block, allowing the limited dynamic range of a 4-bit format to accurately capture the local data distribution.

Why This Matters Intuition Deep Dive Takeaways Related

Why This Matters

As the AI industry hits the memory capacity walls of single physical nodes, transitioning inference from 8-bit to 4-bit is essential to serve trillion-parameter agents natively. Recognizing this, NVIDIA's Blackwell architecture aggressively shifted silicon real estate, reducing FP64 performance drastically (the FP64 to FP32 ratio dropped from 1:2 to 1:64) in favor of dedicating massive die space to NVFP4 Tensor Cores. This transition enables an NVL72 rack to deliver an unprecedented,440 PFLOPS of NVFP4 inference compute (dense) — or up to,880 PFLOPS with 2:4 structured sparsity — across 72 B200 GPUs.

Core Intuition

Representing a neural network's dynamic range using only 4 bits provides a mere 16 discrete values. If a global scale factor is applied to thousands of parameters, a single outlier will force the scale to expand so greatly that the remaining 99% of normal parameters will collapse into the zero bin, destroying the model's logic. NVFP4 solves this by segmenting the data into "micro-blocks" of just 16 elements. By assigning a unique, high-precision scale factor to every group of 16 values, the format tightly bounds the local variance, keeping the data squarely within the [-6, 6] representable range of the 4-bit format.

Technical Deep Dive

The NVFP4 format structure is built upon an E2M1 baseline, possessing 1 sign bit, 2 exponent bits, and 1 mantissa bit. It introduces two major architectural deviations from other 4-bit specifications:

Block Size 16: Elements are grouped into chunks of 16 along the contiguous reduction dimension, providing highly localized adaptation compared to broader block formats.

E4M3 Scale Format: Rather than using simple integer scales, NVFP4 utilizes an 8-bit E4M3 scale for every block. This high-precision scale effectively balances dynamic range and precision at the block level.

Feature	FP4 (Standard)	MXFP4 (OCP)	NVFP4 (NVIDIA)
Format Structure	E2M1	E2M1	E2M1
Block Size	None (Per-tensor)	32	16
Scale Format	Software FP32	E8M0	E4M3
Hardware Scaling	No	Yes	Yes

This deep hardware-level integration within Blackwells 5th generation Tensor Cores allows NVFP4 to perform highly accurate math with a significantly lower risk of accuracy collapse on massive models.

Key Takeaways

NVFP4 utilizes the E2M1 structure, representing values from roughly -6 to 6.

It employs a unique two-level micro-block scaling strategy.

Scale factors are determined for blocks of exactly 16 elements.

The scale format is an 8-bit E4M3 floating-point value.

Hardware-native scaling directly inside Blackwell Tensor Cores eliminates software dequantization latency overheads.

Why This Matters

Core Intuition

Technical Deep Dive

Key Takeaways

Internals / Execution Flow

Performance Implications

Real-World Usage

Common Bottlenecks

Optimization Strategies

Tools & Frameworks

Interview Guidance

Performance Comparisons

Related Concepts