← AI/ML Resources Reinforcement Learning
Browse Topics

Fundamentals of Reinforcement Learning

  • Reinforcement Learning (RL) is a paradigm where an agent learns to make decisions by interacting with an environment to maximize cumulative rewards.
  • The core cycle consists of an agent observing a state, taking an action, receiving a reward, and transitioning to a new state.
  • Unlike supervised learning, RL does not rely on labeled data but instead learns through trial-and-error and the exploration-exploitation trade-off.
  • The goal is to find an optimal policy—a strategy that maps states to actions—to achieve long-term success rather than immediate gratification.

Why It Matters

01
Robotics and Industrial Automation

Companies like Boston Dynamics and various manufacturing firms use RL to train robots for complex manipulation tasks, such as picking and placing objects in unstructured environments. Instead of hard-coding every movement, the robot learns to adjust its grip and force based on real-time sensor feedback, significantly increasing efficiency and adaptability.

02
Recommendation Systems

Platforms like Netflix and YouTube utilize RL to optimize content delivery. By treating the user's interaction (clicks, watch time) as a reward signal, the RL agent learns to sequence recommendations that maximize long-term user engagement rather than just suggesting items similar to what was watched last.

03
Autonomous Vehicles

Companies like Waymo and Tesla explore RL to handle complex driving scenarios, such as merging into heavy traffic or navigating intersections. The agent learns to predict the behavior of other drivers and pedestrians, choosing actions that prioritize safety and traffic flow over thousands of simulated miles before being deployed on the road.

How it Works

The Intuition of Learning by Doing

Reinforcement Learning is fundamentally different from other machine learning paradigms. In supervised learning, a model is provided with a "ground truth" label for every input. In RL, there is no teacher telling the agent exactly what to do. Instead, the agent is like a child learning to walk: it tries a movement, falls, feels the "negative reward" of the fall, and adjusts its muscles to try a different movement next time. This process of trial and error is the heartbeat of RL. The agent must balance exploration (trying new, unknown actions to see if they yield better results) and exploitation (using known actions that have yielded high rewards in the past).


The Markov Decision Process (MDP)

To formalize this interaction, we use the Markov Decision Process. An MDP assumes the "Markov Property," which states that the future depends only on the current state and the current action, not on the history of how the agent arrived there. If you are playing chess, the current board configuration contains all the information you need to decide your next move; the sequence of moves that led to this board state is irrelevant to the decision at hand. This simplification allows us to model complex decision-making problems mathematically.


The Exploration-Exploitation Dilemma

One of the most critical challenges in RL is the exploration-exploitation trade-off. If an agent only exploits—choosing the action it knows currently gives the highest reward—it may get stuck in a "local optimum," never discovering a much better strategy that exists elsewhere in the state space. Conversely, if an agent only explores, it will never capitalize on the knowledge it has gained, resulting in poor performance. Strategies like -greedy (where the agent chooses a random action with probability ) are common ways to force the agent to keep exploring while still favoring known high-reward paths.


Credit Assignment and Delayed Rewards

A significant difficulty in RL is the "credit assignment problem." Imagine a game of chess where you make a brilliant move early in the game, but you don't win until 50 moves later. Which move deserves the credit for the win? Because rewards are often delayed, the agent must learn to associate current actions with future outcomes. This is handled through the concept of discounting, where future rewards are multiplied by a factor (gamma) to signify that immediate rewards are often more certain and valuable than distant ones.

Common Pitfalls

  • RL is just supervised learning with a twist Many learners think RL is simply supervised learning where the labels are rewards. This is incorrect because the agent's actions directly influence the data it sees next, creating a feedback loop that does not exist in supervised learning.
  • The agent learns everything instantly Beginners often expect the agent to find the optimal strategy after a few iterations. In reality, RL requires thousands or millions of interactions to converge, especially in complex, high-dimensional environments.
  • Rewards must be frequent Some believe the agent needs a reward for every action to learn effectively. Actually, RL is specifically designed to handle "sparse rewards," where the agent might only receive a signal after a long sequence of correct decisions.
  • Exploration is always bad Learners often try to minimize exploration as quickly as possible to get "good" results. However, premature convergence to a sub-optimal policy is a common failure mode; maintaining a healthy level of exploration is vital for finding the global optimum.

Sample Code

Python
import numpy as np

# 4x4 grid world: states 0-15, goal=15, hole=5 (penalty)
# Actions: 0=Up, 1=Down, 2=Left, 3=Right
GRID = 4

def env_step(state, action):
    row, col = divmod(state, GRID)
    if action == 0: row = max(row - 1, 0)           # Up
    elif action == 1: row = min(row + 1, GRID - 1)  # Down
    elif action == 2: col = max(col - 1, 0)          # Left
    elif action == 3: col = min(col + 1, GRID - 1)  # Right
    next_state = row * GRID + col
    reward = 1.0 if next_state == 15 else (-1.0 if next_state == 5 else 0.0)
    return next_state, reward

q_table        = np.zeros((16, 4))
learning_rate  = 0.1
discount_factor = 0.95
epsilon        = 0.2

def choose_action(state):
    if np.random.uniform(0, 1) < epsilon:
        return np.random.randint(0, 4)
    return np.argmax(q_table[state])

np.random.seed(42)
for episode in range(1000):
    state = 0
    for _ in range(100):                              # max steps per episode
        action                = choose_action(state)
        next_state, reward    = env_step(state, action)
        best_next             = np.argmax(q_table[next_state])
        td_target             = reward + discount_factor * q_table[next_state, best_next]
        q_table[state, action] += learning_rate * (td_target - q_table[state, action])
        state = next_state
        if state == 15: break

print("Optimal action per state:", np.argmax(q_table, axis=1))
# Output: Optimal action per state: [1 2 1 2 1 3 1 3 3 1 3 3 2 2 3 0]

Key Terms

Agent
The decision-making entity that interacts with the environment to achieve a specific goal. It perceives the current state and selects an action based on its internal policy.
Environment
The external world or system with which the agent interacts. It responds to the agent's actions by providing a new state and a numerical reward signal.
State ($S_t$)
A comprehensive representation of the environment at a specific time step . It contains all the information necessary for the agent to make an informed decision about its next move.
Action ($A_t$)
A choice made by the agent from a set of possible moves within the environment. These actions lead to transitions between states and influence the rewards received.
Reward ($R_t$)
A scalar feedback signal provided by the environment after an action is taken. It serves as the primary mechanism for the agent to evaluate the quality of its actions.
Policy ($\pi$)
A strategy or mapping from states to probabilities of selecting each possible action. The objective of RL is to find an optimal policy that maximizes the expected cumulative reward over time.
Value Function ($V(s)$)
A prediction of the total expected future reward an agent can accumulate starting from a specific state. It helps the agent evaluate how "good" it is to be in a particular situation.