Reinforcement Learning

Model-Based and Model-Free Control

Model-Free RL learns policies directly through trial and error, making it computationally efficient but sample-inefficient.
Model-Based RL builds a predictive internal map of the environment, allowing for planning and foresight at the cost of higher complexity.
The choice between them depends on the trade-off between the cost of gathering real-world data and the computational budget for simulation.
Modern research increasingly focuses on "Model-Based RL with learned dynamics" to bridge the gap between planning and reactive behavior.

Why It Matters

1. **Autonomous Driving

1. Autonomous Driving (Waymo/Tesla): Autonomous vehicles use model-based planning to predict the trajectories of other cars and pedestrians. By maintaining a world model, the car can simulate potential collision scenarios and choose a path that minimizes risk, which is safer than a purely reactive model-free controller. 2. Industrial Robotics (DeepMind/Fanuc): In robotic manipulation, such as picking up delicate objects, model-based control is used to simulate the physics of the gripper and the object. This allows the robot to "plan" the force and angle of the grasp before moving, significantly reducing the wear and tear on the hardware and the number of failed attempts. 3. Financial Trading (Quantitative Hedge Funds): Algorithms often use model-based reinforcement learning to simulate market dynamics under different economic conditions. By building a model of how assets correlate, the agent can test "what-if" strategies for portfolio rebalancing without risking capital in live, high-frequency trading environments.

How it Works

The Intuition: Map vs. Instinct

Imagine you are learning to play a complex video game. A Model-Free approach is like learning by pure muscle memory. You press buttons, observe the screen, and if you win, you reinforce those button presses. You don't know why the game reacts the way it does; you just know that "pressing X in this situation usually leads to a win." You are building a direct link between the visual input and the motor output.

A Model-Based approach is like playing with a strategy guide or a simulator. Before you make a move, you mentally simulate: "If I jump now, I will land on that platform, and then I can reach the power-up." You are building an internal model of the game's physics and logic. You use this model to plan your path. If the game changes (e.g., gravity increases), you update your internal model and your planning changes accordingly, whereas the Model-Free agent would have to relearn its muscle memory from scratch.

Model-Free Control: The Reactive Approach

Model-Free RL (e.g., Q-Learning, PPO, SAC) ignores the underlying mechanics of the environment. The agent treats the environment as a "black box." It observes a state $s$ , takes an action $a$ , and receives a reward $r$ and a new state $s'$ . The agent updates its policy or value function based solely on these transitions.

The primary advantage is simplicity. You do not need to worry about the complexity of the environment's dynamics. If the environment is highly stochastic or non-linear, modeling it might be impossible or computationally prohibitive. However, the downside is that these agents are notoriously "data-hungry." They require millions of interactions to learn simple tasks because they have no "foresight." They only learn what works through repeated failure.

Model-Based Control: The Planning Approach

Model-Based RL (e.g., Dyna-Q, AlphaZero, World Models) attempts to learn the transition function $T(s, a) \to s'$ and the reward function $R(s, a) \to r$ . Once the agent has a model, it can generate "imaginary" experiences. It can perform planning by running simulations in its head.

This is incredibly powerful for scenarios where real-world interaction is expensive or dangerous. For example, in robotics, you cannot afford to have a robot fall over a thousand times to learn how to walk. Instead, the robot learns a model of its own joints and the floor, then uses that model to plan a stable gait. The challenge here is "model bias." If your internal model is slightly wrong, the errors compound during planning, leading the agent to make disastrous decisions based on a faulty simulation.

The Hybrid Frontier

In practice, the line is blurring. Many modern algorithms use "learned models" to augment model-free learning. For instance, an agent might use a model to generate synthetic data to train its policy (Dyna-style), or it might use a model to provide a "look-ahead" feature in a value function. This allows for the efficiency of planning with the robustness of reactive learning.

Common Pitfalls

"Model-Based is always better than Model-Free." This is false; if the environment is too complex to model accurately, the model will be biased, leading to worse performance than a simple model-free agent. Model-based methods often struggle with high-dimensional, chaotic environments where accurate prediction is nearly impossible.
"Model-Free agents cannot plan." While they don't use a learned model, some model-free agents use "search" or "look-ahead" based on the environment's actual dynamics (like Monte Carlo Tree Search in AlphaGo). The distinction is whether the agent learns the model or uses the actual environment as the model.
"Model-Based RL is only for robotics." While common in robotics, it is widely used in games, supply chain optimization, and any domain where a simulator exists. If you have a simulator, you are effectively using a model-based approach, even if you didn't learn the model yourself.
"You must choose one or the other." Modern research, such as the "World Models" paper by Ha and Schmidhuber, shows that the most effective agents often combine both. They use a model to imagine the future while using model-free methods to refine the policy based on those imaginations.

Sample Code

Python

import numpy as np

# Simple GridWorld: 5x5 grid, goal at (4,4)
q_table = np.zeros((5, 5, 4)) # 5x5 states, 4 actions (Up, Down, Left, Right)
alpha, gamma, epsilon = 0.1, 0.9, 0.2

def get_action(state):
    if np.random.rand() < epsilon: return np.random.randint(4)
    return np.argmax(q_table[state[0], state[1]])

# Model-Free Update (Q-Learning)
def update_q(s, a, r, s_next):
    best_next_a = np.argmax(q_table[s_next[0], s_next[1]])
    td_target = r + gamma * q_table[s_next[0], s_next[1], best_next_a]
    q_table[s[0], s[1], a] += alpha * (td_target - q_table[s[0], s[1], a])

# Sample Output:
# After 1000 iterations, the Q-table converges to show the path to (4,4).
# State (3,4) Action Right -> Q-value increases.
# State (4,3) Action Right -> Q-value increases.

Key Terms

Agent

An autonomous entity that interacts with an environment by observing states and executing actions. It is the primary decision-maker that seeks to maximize a long-term reward signal through iterative learning.

Environment

The external system or world in which the agent operates, providing feedback in the form of states and rewards. It follows specific transition dynamics that dictate how the state changes in response to the agent's actions.

Model

A mathematical representation of the environment’s dynamics, specifically the transition function

P(s'|s, a)

and the reward function

R(s, a)

. A model allows the agent to simulate future outcomes without actually interacting with the physical environment.

Policy

A strategy or mapping from states to actions, denoted as

\pi(a|s)

. The goal of the agent is to find an optimal policy that maximizes the expected cumulative return over time.

Sample Efficiency

A measure of how much data an agent requires to reach a certain level of performance. Model-free methods are generally low in sample efficiency, while model-based methods are often higher because they can "re-use" data to learn the model.

Planning

The process of using a model to simulate future trajectories and evaluate sequences of actions before committing to one in the real world. This allows the agent to "think ahead," which is a hallmark of model-based control architectures.

Value Function

A function that estimates the expected future reward for being in a specific state or taking a specific action. It serves as a critic, helping the agent evaluate the quality of its current policy.