Online Softmax Computation
Standard Softmax algorithms require loading the full row of data twice to compute the maximum value (for stability) and the sum of exponentials, creating
Source: mortalapps.com- Standard Softmax algorithms require loading the full row of data twice to compute the maximum value (for stability) and the sum of exponentials, creating massive memory overhead.
- Online Softmax computes the operation incrementally in distinct blocks, allowing the memory complexity of the algorithm to drop exponentially from
to
.
- It utilizes a specific mathematical scaling factor,
, to algorithmically update running sums and outputs whenever a new block introduces a higher maximum value.
- This precise mathematical breakthrough is the foundational engine driving the FlashAttention algorithm, enabling infinite context LLMs.
Why This Matters
In standard Transformer architectures, computing Attention mathematically requires generating a massive matrix (Sequence Length
Sequence Length) representing the interaction scores between all possible tokens. For a sequence length of 100K tokens, this natively requires storing 10 billion FP16 elements (~20GB) just to hold the intermediate softmax state. This makes long-context models physically impossible to run inside fast SRAM. Online Softmax mathematically restructures the operation so the massive
matrix is never fully materialized in HBM, revolutionizing LLM efficiency and making modern context windows viable.
Core Intuition
Calculating standard Softmax is conceptually like trying to grade a test on a complex curve. The grader must first look at every single student's score in the entire school to find the absolute maximum. Then, they must look at everyone's score again to sum the adjusted scores, and finally look a third time to assign the final grade. If you have 1 million students, you physically cannot fit them all in the grading room (SRAM). Online Softmax lets the grader process students precisely as they walk through the door. The grader keeps a "running maximum" and a "running sum." If a new student walks in with a score higher than the current maximum, the grader applies a mathematical discount (a scale factor) to the running sum and output to retroactively correct them. This elegantly updates the grading history without ever needing to see the old students again.
Technical Deep Dive
Standard Softmax strictly requires numerical stabilization due to the threat of exponential overflow (where computing results in inf). It relies on the mathematical property of shift-invariance 10:

where
and .
To process calculations in sub-blocks (tiles) without losing mathematical correctness, the algorithm maintains active running statistics in registers: (maximum),
(denominator sum), and
(unnormalized output).
When transitioning calculation from Block A (featuring an old max ) to Block B (featuring a new local max
):
New Global Max: Evaluate
.
Rescaling Factor: The exact algebraic relationship between exponentials shifted by different maximums is exactly the factor . Because
is guaranteed to be
, this scale factor is always
, structurally preventing overflow.
State Update: We multiply the previous states to logically reflect the new maximum baseline:


Accumulate: Compute the exponentials specifically for the new block using , add them to
, and accumulate the new weighted sum into
. Finally, divide
by
to yield the mathematically exact normalized output.