Reinforcement Learning

Policy Gradient Methods

Policy Gradient methods optimize the agent's behavior directly by adjusting policy parameters to increase the probability of high-reward actions.
Unlike value-based methods (like Q-Learning), these approaches can handle continuous action spaces and stochastic policies naturally.
The core mechanism involves calculating the gradient of the expected return and performing gradient ascent to improve the policy.
Policy Gradient algorithms are inherently on-policy, meaning they learn from data collected by the current version of the agent.
Variance reduction techniques, such as baselines and advantage estimation, are essential for stable and efficient training.

Why It Matters

Robotics and Locomotion

Companies like Boston Dynamics or research labs use policy gradient methods to train quadrupedal robots to walk over uneven terrain. Because walking involves continuous control of many motors, value-based methods are often insufficient. Policy gradients allow the robot to learn smooth, fluid movements by directly optimizing the probability of motor torques that maintain balance while moving forward.

Financial Portfolio Management

Hedge funds and quantitative trading firms utilize reinforcement learning to optimize asset allocation in real-time. The action space involves continuous weights for different assets, and the reward is the risk-adjusted return. Policy gradient methods are preferred here because they can handle the high-dimensional, continuous nature of portfolio rebalancing while accounting for transaction costs and market volatility.

Personalized Recommendation Systems

Large-scale platforms like Netflix or YouTube use RL to optimize long-term user engagement rather than just immediate clicks. By treating the recommendation sequence as a policy, these systems learn to suggest content that maximizes the user's total watch time over a session. Policy gradient methods are particularly useful here because they can model the stochastic nature of user preferences and adapt to changing trends in real-time.

How it Works

The Intuition: Learning by Trial and Error

Imagine you are learning to ride a bicycle. You don't start by calculating the exact physics of every muscle movement. Instead, you try a movement, observe the result (did you fall or stay upright?), and adjust your behavior. If a specific movement leads to a positive outcome (staying upright), you are more likely to repeat that movement in the future. Policy Gradient methods formalize this "trial and error" process. Instead of trying to calculate the "value" of every possible state-action pair, we directly optimize the strategy (the policy) to increase the likelihood of actions that lead to higher rewards.

From Value-Based to Policy-Based

Value-based methods, such as Q-Learning, focus on estimating the value of states or state-action pairs. The agent then chooses the action with the highest estimated value. While effective, this approach struggles in environments with continuous action spaces (e.g., controlling a robotic arm with infinite possible joint angles) because finding the maximum value across an infinite set is computationally impossible. Policy Gradient methods bypass this by parameterizing the policy directly, usually with a neural network. We represent the policy as $\pi_\theta(a|s)$ , where $\theta$ represents the weights of the network. By adjusting $\theta$ , we can shift the probability distribution of actions to favor those that yield higher returns.

The Policy Gradient Theorem

The fundamental challenge in RL is that the environment is often unknown, and we only have access to samples of trajectories. The Policy Gradient Theorem provides a way to calculate the gradient of the expected return without knowing the underlying dynamics of the environment. It states that the gradient of the expected return with respect to the policy parameters $\theta$ is proportional to the expected value of the gradient of the log-probability of the action, multiplied by the return. This is a profound result: it tells us that we can improve our policy simply by observing the rewards we get and updating our parameters in the direction that makes successful actions more probable.

Handling Variance and Stability

One major issue with basic Policy Gradient methods (like REINFORCE) is high variance. Because the return $G_t$ depends on a long sequence of stochastic actions, the gradient estimate can vary wildly between episodes. To combat this, we introduce a "baseline"—a function that subtracts a value from the return to reduce variance without introducing bias. Furthermore, modern approaches often use the "Advantage" function, which measures how much better an action is compared to the average action in that state. By focusing on the advantage rather than the raw return, we stabilize the learning process significantly.

Common Pitfalls

Policy Gradients are always better than Q-Learning This is false; while they handle continuous spaces better, they are often less sample-efficient and have higher variance. Q-Learning is often more stable in discrete, low-dimensional environments.
The gradient update is deterministic In reality, the gradient is estimated from a sample of trajectories, making it a stochastic estimate. Learners often forget that they need many trajectories to get a reliable estimate of the true gradient.
You can use old data to update the policy Because policy gradients are on-policy, using data from a previous version of the policy introduces bias. If you want to use old data, you must use Importance Sampling or switch to an off-policy algorithm like PPO or SAC.
The baseline must be perfect A baseline (like the value function) does not need to be accurate to provide a benefit. Even a simple moving average of rewards can significantly reduce variance and improve convergence speed.

Sample Code

Python

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

# Simple Policy Network
class PolicyNet(nn.Module):
    def __init__(self, state_dim, action_dim):
        super(PolicyNet, self).__init__()
        self.fc = nn.Sequential(nn.Linear(state_dim, 128), nn.ReLU(), nn.Linear(128, action_dim), nn.Softmax(dim=-1))
    
    def forward(self, x):
        return self.fc(x)

# REINFORCE update step
def update_policy(optimizer, log_probs, rewards, gamma=0.99):
    discounted_rewards = []
    R = 0
    for r in reversed(rewards):
        R = r + gamma * R
        discounted_rewards.insert(0, R)
    
    discounted_rewards = torch.tensor(discounted_rewards)
    # Normalize for stability
    discounted_rewards = (discounted_rewards - discounted_rewards.mean()) / (discounted_rewards.std() + 1e-9)
    
    policy_loss = []
    for log_prob, R in zip(log_probs, discounted_rewards):
        policy_loss.append(-log_prob * R)
    
    optimizer.zero_grad()
    loss = torch.stack(policy_loss).sum()
    loss.backward()
    optimizer.step()
    # Output: Policy updated based on trajectory returns.

# Example usage (conceptual):
# net = PolicyNet(4, 2)
# optimizer = optim.Adam(net.parameters(), lr=0.01)
# update_policy(optimizer, log_probs, rewards)

Key Terms

Policy ($\pi$)

A mapping from states to actions that defines the agent's behavior in an environment. It can be deterministic, where an action is fixed for a state, or stochastic, where it provides a probability distribution over possible actions.

Trajectory ($\tau$)

A sequence of states and actions experienced by the agent during an episode, often represented as

\tau = (s_0, a_0, r_1, s_1, a_1, r_2, \dots, s_T)

. It represents the complete history of an interaction between the agent and the environment.

Return ($G_t$)

The total cumulative reward obtained by the agent from time step

t

until the end of the episode. It is often discounted by a factor

\gamma

to prioritize immediate rewards over long-term ones.

Gradient Ascent

An optimization algorithm used to find the local maximum of a function by moving in the direction of the steepest increase. In RL, we perform gradient ascent on the expected return to improve the policy parameters.

Stochastic Policy

A policy that outputs a probability distribution over actions rather than a single action. This allows the agent to explore the environment naturally and handle scenarios where multiple actions might be optimal.

Variance

A measure of how much the return estimates fluctuate across different trajectories. High variance in policy gradient estimates can lead to unstable training and slow convergence, necessitating techniques like baselines.

On-Policy Learning

A learning paradigm where the agent learns only from data generated by its current, updated policy. If the policy changes, the old data becomes obsolete, which makes these methods sample-inefficient compared to off-policy methods.