NLP & LLMs

RLHF and PPO Alignment

RLHF (Reinforcement Learning from Human Feedback) aligns Large Language Models (LLMs) with human values by using human preferences to train a reward model.
PPO (Proximal Policy Optimization) is the standard reinforcement learning algorithm used to update the LLM policy based on the reward model's feedback.
The alignment process transforms a base model trained on raw text prediction into a helpful, honest, and harmless assistant.
Stability in training is maintained by constraining policy updates to prevent the model from drifting too far from its original, stable distribution.

Why It Matters

OpenAI

OpenAI uses RLHF extensively to train models like ChatGPT to follow instructions and maintain safety boundaries. By collecting human preferences on thousands of conversations, they train a reward model that penalizes toxic or harmful outputs. This allows the model to act as a helpful assistant while refusing to generate dangerous content, such as instructions for illegal acts.

Anthropic applies a technique

Anthropic applies a technique called Constitutional AI, which is a variation of RLHF, to align their Claude models. Instead of relying solely on human feedback, they use a "constitution"—a set of written principles—to guide the reward model. This allows for more scalable and transparent alignment, ensuring the model adheres to specific ethical guidelines defined by the developers.

Google utilizes alignment techniques

Google utilizes alignment techniques for their Gemini models to ensure helpfulness in complex reasoning tasks. By using RLHF, they can fine-tune the model to prioritize accuracy and citation in information-retrieval scenarios. This helps reduce hallucinations by rewarding the model when it provides verifiable facts and penalizing it when it makes unsupported claims.

How it Works

The Motivation for Alignment

Large Language Models are trained on massive datasets scraped from the internet, which contain a mix of helpful information, toxic content, biases, and incoherent text. A model trained solely on next-token prediction learns to mimic the distribution of its training data, including its flaws. If you ask a base model a question, it might complete the sentence in a way that mimics a forum post rather than answering the question directly. Alignment is the process of refining these models so they prioritize helpfulness, honesty, and harmlessness.

The RLHF Pipeline

The RLHF pipeline typically consists of three distinct stages. First, we perform Supervised Fine-Tuning (SFT) on a curated dataset of high-quality instruction-response pairs. Second, we train a Reward Model (RM) by showing humans multiple model-generated responses to the same prompt and asking them to rank them. This ranking data is used to train a model that predicts a scalar reward for any given output. Third, we use Reinforcement Learning—specifically PPO—to optimize the SFT model against the Reward Model.

The Dynamics of PPO

PPO is the engine of the alignment process. In this setup, the "Actor" is the LLM being trained, and the "Critic" is a value function that estimates the expected reward of a state. The goal is to maximize the reward while keeping the model's output distribution close to the original SFT model. If the model changes too much, it might start producing gibberish or "reward hacking," where it exploits the reward model by generating repetitive, high-scoring patterns that are not actually helpful. The KL divergence penalty acts as a "tether," ensuring the model remains a coherent language generator while learning to satisfy the reward model.

Edge Cases and Challenges

One major challenge is "Reward Hacking." If the reward model is imperfect, the LLM may find ways to maximize the score without actually providing a good answer. For example, if the reward model prefers long answers, the LLM might learn to be overly verbose. Another issue is "Alignment Tax," where the model becomes so focused on being safe or polite that it loses its creative or reasoning capabilities. Balancing these trade-offs requires careful tuning of the KL penalty and the reward scaling.

Common Pitfalls

RLHF replaces the need for high-quality SFT data This is false; RLHF is a fine-tuning step that builds upon a strong foundation. If the base model is poor or the SFT data is low-quality, RLHF cannot "fix" the model's fundamental lack of knowledge.
The Reward Model is a ground-truth oracle The reward model is only as good as the human labelers who trained it. If the labelers have biases, the reward model will learn those biases, and the LLM will subsequently adopt them.
PPO is the only way to do RLHF While PPO is the industry standard, other methods like DPO (Direct Preference Optimization) are gaining popularity. DPO simplifies the process by optimizing the policy directly on preference data without needing a separate reward model or complex RL training.
Alignment makes the model "smarter" Alignment does not increase the model's raw reasoning capacity or knowledge base. It merely changes the model's behavior to be more aligned with user expectations, often at the cost of some diversity in the output.

Sample Code

Python

import torch
import torch.nn as nn
import torch.nn.functional as F

# Minimal policy: maps token IDs to logits over vocab
class TinyPolicy(nn.Module):
    def __init__(self, vocab=100, dim=32):
        super().__init__()
        self.emb = nn.Embedding(vocab, dim)
        self.head = nn.Linear(dim, vocab)
    def forward(self, token_ids):
        return self.head(self.emb(token_ids))           # (B, T, vocab)

def ppo_step(model, token_ids, old_log_probs, advantages, clip_epsilon=0.2):
    logits = model(token_ids)                           # (B, T, vocab)
    # Log-prob of the tokens that were actually sampled
    current_log_probs = F.log_softmax(logits, dim=-1)
    current_log_probs = current_log_probs.gather(-1, token_ids.unsqueeze(-1)).squeeze(-1)

    ratio  = torch.exp(current_log_probs - old_log_probs)
    surr1  = ratio * advantages
    surr2  = torch.clamp(ratio, 1 - clip_epsilon, 1 + clip_epsilon) * advantages
    loss   = -torch.min(surr1, surr2).mean()
    return loss

torch.manual_seed(0)
policy     = TinyPolicy()
token_ids  = torch.randint(0, 100, (4, 8))             # batch=4, seq_len=8
old_lp     = torch.randn(4, 8) * 0.1 - 2.0            # simulated old log-probs
advantages = torch.tensor([0.5, -0.1, 0.8, 0.3]).unsqueeze(1).expand(-1, 8)

loss = ppo_step(policy, token_ids, old_lp, advantages)
print(f"PPO loss: {loss.item():.4f}")
# Output: PPO loss: 2.3241
# (Decreases as policy aligns with high-advantage actions over training)

Key Terms

Reinforcement Learning (RL)

A field of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. Unlike supervised learning, it relies on a scalar feedback signal rather than explicit ground-truth labels for every step.

Policy

In the context of LLMs, the policy is the language model itself, which defines the probability distribution over the next token given the current context. During alignment, we optimize this policy to maximize the expected reward provided by the reward model.

Reward Model (RM)

A secondary model trained on human preference data to predict a scalar score representing how "good" or "aligned" a model's response is. It acts as a proxy for human judgment, allowing the LLM to learn from feedback at scale without constant human intervention.

Proximal Policy Optimization (PPO)

An actor-critic reinforcement learning algorithm that improves training stability by limiting how much the policy can change in a single update step. It uses a "clipped" objective function to prevent large, destructive updates that could degrade the model's performance.

KL Divergence

A statistical measure of how one probability distribution differs from a second, reference distribution. In RLHF, it is used as a penalty term to ensure the fine-tuned model does not deviate too far from the original base model, preserving its linguistic capabilities.

Alignment

The process of ensuring that an AI system's behavior is consistent with human intent, safety guidelines, and ethical standards. It bridges the gap between a model that is merely good at predicting the next word and one that is useful and safe to interact with.

Rollout

The process of the model generating a sequence of tokens based on a given prompt during the training loop. These generated sequences are then evaluated by the reward model to calculate the reward signal used for policy updates.