NLP & LLMs

Temperature Scaling in Inference

Temperature scaling is a post-processing technique used to calibrate the confidence of neural network predictions.
It adjusts the sharpness of the probability distribution by dividing the logits by a scalar parameter $T$ before applying the softmax function.
High temperature values (T > 1) flatten the distribution, increasing entropy and making the model less confident.
Low temperature values (T < 1) sharpen the distribution, concentrating probability mass on the most likely tokens.
It is a critical tool for aligning model-assigned probabilities with actual empirical accuracy in classification and generation tasks.

Why It Matters

Medical diagnostics

In the domain of medical diagnostics, AI systems are often used to triage patient data or suggest potential conditions. Because the cost of a false positive or negative is extremely high, these systems must be well-calibrated so that a "high confidence" score actually correlates with a high probability of correctness. By applying temperature scaling to the output of clinical NLP models, developers ensure that the system only flags cases for human review when the model's uncertainty is genuinely low, thereby optimizing the workload of medical professionals.

Financial institutions utilize LLMs

Financial institutions utilize LLMs to analyze sentiment in market news and social media to inform trading strategies. In this context, the model's confidence in a "bullish" or "bearish" sentiment is used to weight the size of a trade. If a model is overconfident, it might trigger large, risky trades based on noisy data. Temperature scaling is used here to dampen the model's confidence during periods of high market volatility, preventing the automated system from overreacting to ambiguous information.

Content moderation platforms, such

Content moderation platforms, such as those used by social media giants like Meta or Discord, employ classifiers to detect toxic language. These systems often face the challenge of "distribution shift," where the language used by users changes over time. Temperature scaling is frequently applied as a post-hoc calibration step to ensure that the threshold for "flagging" content remains consistent even as the underlying data distribution drifts. This helps maintain a stable user experience by preventing the system from becoming overly aggressive or permissive as it encounters new slang or evolving patterns of speech.

How it Works

The Intuition of Probability Control

When we interact with a Large Language Model, we are essentially asking it to predict the next token in a sequence. The model generates a list of scores for every possible word in its vocabulary. These raw scores, or logits, are then passed through a Softmax function to turn them into percentages. However, these raw probabilities are often misleading. A model might be "overconfident," assigning a 99% probability to a word that is actually incorrect. Temperature scaling is the "volume knob" for this confidence. By introducing a single scalar value, $T$ , we can manipulate the shape of the probability distribution without changing the underlying model weights.

Why Models Need Calibration

Modern neural networks, especially deep Transformers, are notoriously poorly calibrated. Because they are trained using cross-entropy loss, they are incentivized to be as confident as possible to minimize error. This leads to a phenomenon where the model's confidence does not reflect its actual accuracy. If you ask a model to generate text, a high temperature makes the model "creative" or "random" by spreading probability across many words. A low temperature makes the model "deterministic" or "conservative" by focusing almost entirely on the single most likely word. Understanding this allows practitioners to balance the trade-off between diversity and precision.

The Mechanics of Softmax Manipulation

The standard Softmax function is defined as $P_i = \frac{e^{z_i}}{\sum e^{z_j}}$ . When we introduce temperature, the formula becomes $P_i = \frac{e^{z_i/T}}{\sum e^{z_j/T}}$ . When $T=1$ , the distribution remains unchanged. As $T \to \infty$ , the exponent $z_i/T$ approaches zero, and $e^0$ becomes 1. Consequently, the probability distribution becomes uniform—every word becomes equally likely, representing maximum uncertainty. Conversely, as $T \to 0$ , the largest logit dominates the exponentiation, effectively turning the output into a "hard" decision (a one-hot vector). This mechanism is not just for text generation; it is a standard technique in uncertainty estimation and out-of-distribution detection, allowing systems to signal when they are unsure of their own output.

Common Pitfalls

Mistaking temperature for a training parameter Many learners believe temperature is a weight that needs to be updated via backpropagation. In reality, temperature scaling is almost always a post-processing step applied to a frozen model, usually optimized on a separate validation set.
Assuming temperature changes the model's knowledge Temperature does not change the underlying logic or the "intelligence" of the model; it only changes the distribution of the output. The model's internal representation of the data remains identical regardless of the temperature used during inference.
Ignoring numerical stability Beginners often implement softmax by simply exponentiating the logits. This leads to NaN errors due to floating-point overflow; one must always subtract the maximum logit value before exponentiating to keep the values within a stable range.
Over-tuning the temperature Some practitioners try to find a "perfect" temperature for every single input. Temperature is typically a global hyperparameter or a task-specific constant; trying to optimize it per-token often leads to overfitting on the validation set and poor generalization.

Sample Code

Python

import numpy as np

def softmax(logits, temperature=1.0):
    """
    Applies temperature-scaled softmax to a vector of logits.
    """
    # Divide logits by temperature to adjust sharpness
    scaled_logits = np.array(logits) / temperature
    
    # Subtract max for numerical stability (prevents overflow)
    shifted_logits = scaled_logits - np.max(scaled_logits)
    
    exp_logits = np.exp(shifted_logits)
    return exp_logits / np.sum(exp_logits)

# Example: A model predicting probabilities for 3 tokens
logits = [2.0, 1.0, 0.1]

# Low temperature: Sharp, confident output
print(f"T=0.5: {softmax(logits, temperature=0.5)}")
# Output: [0.85, 0.11, 0.02]

# High temperature: Flat, uncertain output
print(f"T=2.0: {softmax(logits, temperature=2.0)}")
# Output: [0.49, 0.30, 0.21]

Key Terms

Logits

The raw, unnormalized output scores produced by the final layer of a neural network before the activation function is applied. These values represent the model's internal "evidence" for each class or token, ranging from negative to positive infinity.

Softmax

A mathematical function that converts a vector of

K

real numbers into a probability distribution of

K

possible outcomes. It ensures that all output values are between 0 and 1 and sum exactly to 1, making them interpretable as probabilities.

Calibration

The property where the predicted probability of a model matches the empirical frequency of the outcome. A perfectly calibrated model that predicts a 70% probability for a class should be correct exactly 70% of the time.

Entropy

A measure of uncertainty or randomness within a probability distribution. High entropy indicates a "flat" distribution where many outcomes are considered equally likely, while low entropy indicates a "peaky" distribution where one outcome dominates.

Overconfidence

A common failure mode in deep learning where a model assigns high probability scores to incorrect predictions. This often occurs because modern neural networks are trained to minimize cross-entropy loss, which encourages the model to push logits toward extreme values.

Inference

The process of using a trained machine learning model to make predictions on new, unseen data. During this stage, the model parameters are frozen, and we only manipulate input data and hyperparameters like temperature to control the output behavior.