Partially Observable Markov Decision Processes
- POMDPs extend standard MDPs by assuming the agent cannot directly observe the full state of the environment.
- The agent must maintain a "belief state," a probability distribution over all possible states, to make informed decisions.
- Solving POMDPs is computationally expensive (PSPACE-complete) because the belief space is continuous and high-dimensional.
- Modern approaches leverage Deep Reinforcement Learning, specifically Recurrent Neural Networks, to implicitly track history and approximate belief states.
Why It Matters
Autonomous driving is a classic POMDP application. A self-driving car cannot "see" the intent of a pedestrian or the state of a hidden intersection behind a truck. The vehicle must maintain a belief state about the likelihood of various hidden hazards, adjusting its speed and trajectory based on the probability of these hidden states rather than just the immediate visual input.
In financial algorithmic trading, the market state is inherently hidden. A trader observes price movements and volume (observations), but the underlying "market regime" (e.g., bull vs. bear, high vs. low volatility) is a hidden state. Trading agents use POMDP frameworks to infer the current regime and adjust their risk exposure accordingly, treating the market as a partially observable system.
Healthcare diagnostic systems utilize POMDPs to manage patient care under uncertainty. A patient’s true physiological state is often hidden, and diagnostic tests provide noisy observations. By modeling the patient as a POMDP, clinical decision-support systems can recommend the most informative tests or treatments that maximize the expected health outcome while minimizing the cost of uncertainty.
How it Works
The Intuition of Partial Observability
In a standard Markov Decision Process (MDP), we assume the agent has "God-like" access to the environment. If you are playing a game of Chess, you see every piece on the board; the state is fully observable. However, consider a game of Poker. You see your cards and the community cards, but you do not know your opponent's hand. You are operating under partial observability. A Partially Observable Markov Decision Process (POMDP) is the mathematical framework we use to model these scenarios where the agent’s sensors are limited, noisy, or incomplete.
The core challenge is that the current observation does not tell the whole story. If a robot is navigating a dark room, seeing a wall in front of it doesn't tell the robot exactly where it is in the room. It only tells the robot that it is near a wall. To make a good decision, the robot must remember where it was a moment ago and how it moved. This reliance on history is the defining characteristic of POMDPs.
The Belief State Mechanism
Since the agent cannot know the true state, it must maintain a "belief." A belief state is a probability distribution over the set of all possible states . If the agent believes there is a 70% chance it is in the kitchen and a 30% chance it is in the living room, that distribution is its belief state.
When the agent takes an action and receives a new observation , it updates its belief state using Bayes' Rule. This update process is the engine of a POMDP agent. It takes the previous belief, incorporates the transition model (how the world moves), and filters it through the observation model (how the world looks). Over time, if the agent collects enough data, its belief state might collapse into a single high-probability state, effectively "solving" the uncertainty.
Complexity and Computational Limits
The primary hurdle in POMDPs is the "Curse of Dimensionality." Because the belief state is a continuous distribution over a discrete (or continuous) state space, the space of possible beliefs is infinite. Finding an optimal policy in this space is PSPACE-complete, meaning it is significantly harder than solving standard MDPs.
To handle this, practitioners often use point-based value iteration or approximate the belief state using deep neural networks. By using Recurrent Neural Networks (RNNs) or Long Short-Term Memory (LSTM) units, we allow the agent to learn an internal representation of the history. This internal representation acts as a "latent belief state," allowing the agent to perform well in complex environments without explicitly calculating the Bayesian update at every step. This shift from explicit probability tracking to implicit latent tracking is the foundation of modern Deep RL for POMDPs.
Common Pitfalls
- Confusing POMDPs with Stochastic MDPs Learners often think that because an environment is stochastic (random), it is a POMDP. However, if the state is fully visible, it is just a standard MDP; partial observability specifically refers to the inability to see the state, not just the randomness of transitions.
- Assuming Belief States are always Gaussian Many assume that belief states must be represented as Gaussian distributions (like in a Kalman Filter). In reality, belief states can be multi-modal or arbitrary distributions, especially in complex environments where the agent could be in one of several distinct locations.
- Ignoring the History Dependency A common mistake is trying to map observations directly to actions without a memory mechanism. Without memory or a belief update, the agent loses the Markov property, leading to poor performance in environments where the current observation is ambiguous.
- Underestimating the Computational Cost Students often attempt to solve large-scale POMDPs using exact value iteration. Because the belief space is continuous, exact methods are only feasible for tiny problems; real-world applications almost always require function approximation or sampling-based methods.
Sample Code
import numpy as np
# 1D corridor POMDP — states 0,1,2 | actions 0=move_right, 1=stay
# Observations: 0=wall, 1=open
class SimplePOMDP:
def __init__(self):
# T[a, s, s']: P(s' | s, a)
self.T = np.array([
[[0.2, 0.8, 0.0], # a=0 from s=0: 80% → s=1
[0.0, 0.2, 0.8], # a=0 from s=1: 80% → s=2
[0.0, 0.0, 1.0]], # a=0 from s=2: wall, stay
[[1.0, 0.0, 0.0], # a=1 from s=0: stay
[0.0, 1.0, 0.0], # a=1 from s=1: stay
[0.0, 0.0, 1.0]] # a=1 from s=2: stay
])
# O[s, o]: P(obs | s)
self.O = np.array([
[0.9, 0.1], # s=0: wall (obs=0) likely
[0.1, 0.9], # s=1: open (obs=1) likely
[0.9, 0.1], # s=2: wall (obs=0) likely
])
self.belief = np.array([1/3, 1/3, 1/3])
def update_belief(self, action, observation):
# Bayesian update: b'(s') ∝ O[s',z] * Σ_s T[a,s,s'] * b(s)
new_belief = np.array([
self.O[sp, observation] * sum(
self.T[action, s, sp] * self.belief[s] for s in range(3))
for sp in range(3)
])
total = new_belief.sum()
self.belief = new_belief / total if total > 0 else self.belief
return self.belief
env = SimplePOMDP()
new_b = env.update_belief(action=0, observation=0) # moved right, saw wall
print(f"Updated Belief: {np.round(new_b, 4)}")
# Output: Updated Belief: [0.0947 0.0526 0.8526]
# High probability in s=2 — agent likely moved to far wall