← Infrastructure Transformer Systems
Infrastructure

Sliding Window Attention

Sliding Window Attention (SWA) restricts a token's attention solely to a fixed-size window of preceding tokens, rather than the entire context history.

Source: mortalapps.com
TL;DR
  • Sliding Window Attention (SWA) restricts a token's attention solely to a fixed-size window of preceding tokens, rather than the entire context history.
  • It utilizes a rolling buffer for the KV cache, rigidly capping maximum memory consumption to regardless of the sequence length.
  • SWA leverages the "receptive field" concept: through stacked transformer layers, high-level tokens indirectly aggregate information far beyond the window size.
  • Pioneered in architectures like Mistral 7B, it enables processing infinitely long text streams at a fixed computational and memory cost.

Why This Matters

Full attention mandates time and memory per request. For agents parsing endless streams of logs or reading entire codebases, trends toward infinity, which inevitably crashes the system. By enforcing a hard maximum on the KV cache via a window (e.g., ), SWA mathematically decouples computational complexity from the total sequence length. This guarantees that VRAM will never OOM due to context length once the window is saturated, allowing for infinite sequence processing.

Core Intuition

When reading a book, to understand a word in chapter, you rarely need to explicitly cross-reference a specific word from chapter 1. You only need the local context (the current paragraph) and the high-level plot summary. SWA enforces this structurally. A token only "sees" the previous tokens. However, because transformers have many layers, a token at Layer 2 sees tokens from Layer 1. Those Layer 1 tokens each saw tokens from Layer 0. Thus, the effective "receptive field" expands linearly with depth, allowing the top layer to indirectly access information up to tokens back without holding it all in memory.

Technical Deep Dive

During standard attention, the KV cache grows indefinitely. In SWA, the system allocates a Rolling Buffer Cache of fixed size . Position in the cache is computed using modulo arithmetic: cache_position = i % W. When token is generated, its Key and Value overwrite the Key and Value of token 1 in the physical array. The self-attention operation uses a BlockDiagonalCausalMask (often implemented via xFormers). The queries only attend to the valid entries in the rolling buffer. Mathematically, a token at layer attends to

at layer k−1. Consequently, the top layer accesses information from a wider context than any individual layer.

Key Takeaways

Sliding Window Attention strictly caps attention to the last tokens.
KV Cache is implemented as a fixed-size Rolling Buffer utilizing modulo arithmetic (index % W).
High-level tokens access distant history through the expanding Receptive Field of deep transformer layers.
It delivers strictly memory consumption and generation time per token.