NLP & LLMs

Top-k and Nucleus Sampling

Top-k sampling limits the model's next-token selection to the $k$ most probable candidates, preventing the inclusion of extremely low-probability "tail" words.
Nucleus sampling (Top-p) dynamically adjusts the pool of candidates by selecting the smallest set of tokens whose cumulative probability mass exceeds a threshold $p$ .
Both techniques are essential for balancing the trade-off between coherence (staying on topic) and diversity (avoiding repetitive or robotic text).
These methods replace simple greedy decoding, which often leads to repetitive loops, and random sampling, which often leads to incoherent gibberish.

Why It Matters

Creative Writing Assistants

Platforms like Sudowrite or Jasper use nucleus sampling to help authors generate story continuations. By adjusting the $p$ value, these tools allow users to toggle between "conservative" mode (lower $p$ , more logical plot progression) and "creative" mode (higher $p$ , more surprising plot twists). This flexibility is crucial for maintaining the balance between thematic consistency and narrative innovation.

Customer Support Chatbots

Companies like Intercom or Zendesk implement these sampling techniques to ensure their AI agents sound professional yet human. By using a moderate Top-k value, they prevent the model from hallucinating obscure or irrelevant technical jargon while still ensuring the responses don't sound like a static, pre-programmed script. This creates a natural conversational flow that feels helpful and personalized.

Code Completion Engines

Tools like GitHub Copilot utilize these sampling methods to suggest code snippets in IDEs. Because code syntax is strictly defined, the model often uses a very low $p$ or $k$ value to ensure the generated code is syntactically correct and follows the project's existing patterns. When the model encounters a comment or a docstring, it may temporarily increase the sampling parameters to allow for more natural, descriptive language generation.

How it Works

The Challenge of Text Generation

When an LLM generates text, it does not "know" the answer; it calculates a probability distribution over its entire vocabulary. If you ask a model to complete the sentence "The cat sat on the...", the model might assign 40% probability to "mat", 20% to "floor", 5% to "sofa", and 0.0001% to "refrigerator". If we always pick the highest probability (greedy decoding), the model becomes predictable and prone to repeating the same phrases. If we pick randomly based on the full distribution, the model will eventually pick the 0.0001% chance word, leading to nonsensical output. Sampling strategies like Top-k and Nucleus sampling are the "filters" we apply to this distribution to ensure the model picks from a "sensible" subset of words.

Top-k Sampling: The Fixed Filter

Top-k sampling is a static approach. We define a hyperparameter $k$ (e.g., $k=50$ ). At every generation step, the model identifies the 50 tokens with the highest probability. It then sets the probability of all other tokens in the vocabulary to zero and re-normalizes the remaining 50 probabilities so they sum to 1.0. The model then draws a sample from this truncated distribution. The primary advantage is that it effectively prunes the "long tail" of the distribution—the thousands of irrelevant words that would make the sentence incoherent. However, $k$ is rigid. If the model is very confident (e.g., "The capital of France is..."), $k=50$ might be too wide, forcing the model to consider unlikely words. If the model is uncertain, $k=50$ might be too narrow, excluding valid options.

Nucleus Sampling (Top-p): The Dynamic Filter

Nucleus sampling, introduced by Holtzman et al. (2019), addresses the rigidity of Top-k. Instead of picking a fixed number of tokens, we pick a probability threshold $p$ (e.g., $p=0.9$ ). The model sorts all tokens by probability and calculates the cumulative sum. It stops adding tokens to the "nucleus" as soon as the sum of their probabilities exceeds $p$ . If the model is confident, the nucleus might contain only one or two tokens. If the model is uncertain, the nucleus might expand to include hundreds of tokens. This dynamic behavior allows the model to be creative when the context allows for it and precise when the context demands it.

Edge Cases and Interaction

In practice, these methods are often combined. One might apply Top-k first to remove extreme noise and then apply Top-p to the remaining tokens. A critical edge case occurs when the model is extremely confident in a single token; if $p$ is set too low, the model might be forced to choose a token that is technically within the "nucleus" but contextually poor. Furthermore, when the model enters a "repetition loop," these sampling methods can sometimes struggle to break the cycle because the probability of the repeated token remains high in the model's internal state. Advanced practitioners often use "frequency penalties" or "presence penalties" alongside sampling to explicitly discourage the model from reusing tokens, effectively modifying the logits before the sampling stage occurs.

Common Pitfalls

"Higher sampling values always mean better quality." This is incorrect; increasing $p$ or $k$ too much allows the model to select low-probability "tail" tokens, which rapidly degrades coherence. Quality is a balance, not a linear progression, and optimal values are highly task-dependent.
"Top-k and Nucleus sampling are mutually exclusive." In many production systems, these are used in tandem to provide a "double-filter" approach. Top-k acts as a coarse filter to remove extreme outliers, while Top-p acts as a fine-grained filter to handle the remaining distribution.
"These sampling methods fix model hallucinations." Sampling only controls how the model selects from its internal probability distribution; it does not change the model's underlying knowledge. If the model is confidently wrong, sampling will simply make it "confidently wrong" in a variety of creative ways.
"Temperature is the same as Top-p." Temperature modifies the shape of the entire distribution by flattening or sharpening it, whereas Top-p truncates the distribution entirely. They are complementary tools that serve different roles in controlling the generation process.

Sample Code

Python

import torch
import torch.nn.functional as F

def sample_next_token(logits: torch.Tensor, top_k: int = 50,
                      top_p: float = 0.9, temperature: float = 1.0) -> int:
    """Combined top-k + nucleus (top-p) sampling on a 1-D logit vector."""
    logits = logits / temperature
    probs  = F.softmax(logits, dim=-1)           # shape: [vocab_size]

    # ── Top-k: zero out all but the k highest-probability tokens ──
    top_k_vals, top_k_idx = torch.topk(probs, top_k)
    filtered = torch.zeros_like(probs)
    filtered.scatter_(0, top_k_idx, top_k_vals)  # restored to vocab-space

    # ── Top-p (nucleus): within those k tokens keep the smallest
    #    cumulative-probability set that still exceeds top_p ──
    sorted_vals, sorted_ord = torch.sort(filtered, descending=True)
    cum = torch.cumsum(sorted_vals, dim=-1)
    # Remove tokens whose *running* total already exceeds top_p,
    # but keep at least one token (shift mask right by 1)
    remove = (cum - sorted_vals) >= top_p
    filtered.scatter_(0, sorted_ord, sorted_vals * (~remove).float())

    filtered /= filtered.sum()
    return torch.multinomial(filtered, num_samples=1).item()

torch.manual_seed(7)
logits = torch.randn(50000)               # simulated one-step LLM output
token  = sample_next_token(logits, top_k=50, top_p=0.9)
print(f"Sampled token index: {token}")
# Output: Sampled token index: 23174

Key Terms

Autoregressive Generation

The process where a model generates text one token at a time, using its previous outputs as input for the next step. This sequential nature means errors can compound, making sampling strategies vital for maintaining quality.

Softmax Layer

The final layer of an LLM that converts raw model outputs (logits) into a probability distribution that sums to 1.0. This distribution represents the model's confidence in which token should come next.

Greedy Decoding

A strategy where the model always selects the single token with the highest probability at each step. While computationally efficient, it often results in repetitive, dull, or trapped text patterns.

Temperature Scaling

A hyperparameter used to reshape the probability distribution before sampling, where values less than 1.0 make the distribution "sharper" (more confident) and values greater than 1.0 make it "flatter" (more random).

Cumulative Probability

The sum of probabilities of a set of tokens, usually ordered from most to least likely. Nucleus sampling uses this to determine a dynamic cutoff point for candidate selection.

Vocabulary

The complete set of unique tokens (words, subwords, or characters) that the LLM is capable of predicting. Modern LLMs typically have vocabularies ranging from 30,000 to over 100,000 tokens.