← Infrastructure CUDA
Infrastructure

Online Softmax Computation

Standard Softmax algorithms require loading the full row of data twice to compute the maximum value (for stability) and the sum of exponentials, creating

Source: mortalapps.com
TL;DR
  • Standard Softmax algorithms require loading the full row of data twice to compute the maximum value (for stability) and the sum of exponentials, creating massive memory overhead.
  • Online Softmax computes the operation incrementally in distinct blocks, allowing the memory complexity of the algorithm to drop exponentially from to .
  • It utilizes a specific mathematical scaling factor, , to algorithmically update running sums and outputs whenever a new block introduces a higher maximum value.
  • This precise mathematical breakthrough is the foundational engine driving the FlashAttention algorithm, enabling infinite context LLMs.

Why This Matters

In standard Transformer architectures, computing Attention mathematically requires generating a massive matrix (Sequence Length Sequence Length) representing the interaction scores between all possible tokens. For a sequence length of 100K tokens, this natively requires storing 10 billion FP16 elements (~20GB) just to hold the intermediate softmax state. This makes long-context models physically impossible to run inside fast SRAM. Online Softmax mathematically restructures the operation so the massive matrix is never fully materialized in HBM, revolutionizing LLM efficiency and making modern context windows viable.

Core Intuition

Calculating standard Softmax is conceptually like trying to grade a test on a complex curve. The grader must first look at every single student's score in the entire school to find the absolute maximum. Then, they must look at everyone's score again to sum the adjusted scores, and finally look a third time to assign the final grade. If you have 1 million students, you physically cannot fit them all in the grading room (SRAM). Online Softmax lets the grader process students precisely as they walk through the door. The grader keeps a "running maximum" and a "running sum." If a new student walks in with a score higher than the current maximum, the grader applies a mathematical discount (a scale factor) to the running sum and output to retroactively correct them. This elegantly updates the grading history without ever needing to see the old students again.

Technical Deep Dive

Standard Softmax strictly requires numerical stabilization due to the threat of exponential overflow (where computing results in inf). It relies on the mathematical property of shift-invariance 10:

where and .

To process calculations in sub-blocks (tiles) without losing mathematical correctness, the algorithm maintains active running statistics in registers: (maximum), (denominator sum), and (unnormalized output).

When transitioning calculation from Block A (featuring an old max ) to Block B (featuring a new local max ):

New Global Max: Evaluate .

Rescaling Factor: The exact algebraic relationship between exponentials shifted by different maximums is exactly the factor . Because is guaranteed to be , this scale factor is always , structurally preventing overflow.

State Update: We multiply the previous states to logically reflect the new maximum baseline:

Accumulate: Compute the exponentials specifically for the new block using , add them to , and accumulate the new weighted sum into . Finally, divide by to yield the mathematically exact normalized output.

Key Takeaways

Online Softmax fundamentally allows the computation of global statistics incrementally across entirely independent memory blocks.
The mathematical scaling factor perfectly updates previous running sums to the new maximum, maintaining strict mathematical equivalence to standard Softmax.
Because the maximum value monotonically increases over time, the scaling factor is guaranteed to always be , ensuring absolute numerical stability.
This elegant mathematical trick is the definitive engine driving FlashAttention, entirely eliminating the catastrophic memory bottleneck of transformers.