Bellman Expectation & Optimality Equations

How value functions relate to rewards, transitions, and optimal control.

What you’ll learn

Bellman expectation equations for $ v^\pi $ and $ q^\pi $
Bellman operators $T^\pi$ and their contraction property
Bellman optimality equations for $ v^* $ and $ q^* $
How to solve small MDPs via linear systems and iterative updates
How these equations are the backbone of Dynamic Programming, Monte Carlo, and TD

The Bellman equations are the “physics laws” of RL value functions.

Code:

import numpy as np
import matplotlib.pyplot as plt

np.set_printoptions(precision=3, suppress=True)

1. Bellman Expectation Equation — State-Value

For a fixed policy $ \pi $, the state-value function is

\[v^\pi(s) = \mathbb{E}_\pi \big[ G_t \mid S_t = s \big], \quad G_t = \sum_{k=0}^\infty \gamma^k R_{t+k+1}.\]

Using the Markov property and unrolling one step:

\[\begin{aligned} v^\pi(s) &= \mathbb{E}_\pi \big[ R_{t+1} + \gamma G_{t+1} \mid S_t = s \big] \\ &= \mathbb{E}_\pi \big[ R_{t+1} + \gamma v^\pi(S_{t+1}) \mid S_t = s \big]. \end{aligned}\]

In terms of the MDP dynamics:

\[v^\pi(s) = \sum_a \pi(a \mid s) \sum_{s', r} p(s', r \mid s, a)\,\big( r + \gamma v^\pi(s') \big).\]

For finite state spaces, we can write this in vector form:

\[v^\pi = r^\pi + \gamma P^\pi v^\pi,\]

where:

$ v^\pi \in \mathbb{R}^{|\mathcal{S}|} $ is the value vector
$ r^\pi \in \mathbb{R}^{|\mathcal{S}|} $ with $ r^\pi(s) = \mathbb{E}[R_{t+1} \mid S_t=s, \pi] $
$ P^\pi $ is the transition matrix induced by policy $ \pi $

So,

\[(\mathbf{I} - \gamma P^\pi) v^\pi = r^\pi \quad \Rightarrow \quad v^\pi = (\mathbf{I} - \gamma P^\pi)^{-1} r^\pi,\]

for small MDPs where matrix inversion is practical.

Code:

# Tiny 3-state MDP: s0, s1, terminal s2
STATE_NAMES = {0: "s0", 1: "s1", 2: "terminal"}
ACTION_NAMES = {0: "Left", 1: "Right"}

gamma = 0.9

# Transition model P(s' | s, a) with fixed rewards r(s,a,s')
P_env = {
    (0, 0): [(1.0, 0, 0.0)],   # s0, Left  -> s0, r=0
    (0, 1): [(1.0, 1, +1.0)],  # s0, Right -> s1, r=+1
    (1, 0): [(1.0, 0, +0.5)],  # s1, Left  -> s0, r=+0.5
    (1, 1): [(1.0, 2, +2.0)],  # s1, Right -> s2, r=+2
    # terminal is absorbing
    (2, 0): [(1.0, 2, 0.0)],
    (2, 1): [(1.0, 2, 0.0)],
}

num_states = 3
num_actions = 2

# Deterministic policy π: always take Right (1) in s0,s1
policy_right = {0: 1, 1: 1, 2: 0}  # action in each state

def build_P_pi_and_r_pi(policy):
    P_pi = np.zeros((num_states, num_states))
    r_pi = np.zeros(num_states)
    for s in range(num_states):
        for a in range(num_actions):
            pi_sa = 1.0 if policy[s] == a else 0.0
            for p, s_next, r in P_env[(s, a)]:
                P_pi[s, s_next] += pi_sa * p
                r_pi[s] += pi_sa * p * r
    return P_pi, r_pi

P_pi, r_pi = build_P_pi_and_r_pi(policy_right)

print("P^π (policy 'go Right'):\n", P_pi)
print("\nr^π:\n", r_pi)

Output:

P^π (policy 'go Right'):
 [[0. 1. 0.]
 [0. 0. 1.]
 [0. 0. 1.]]

r^π:
 [1. 2. 0.]

Code:

# Treat state 2 as terminal with v(terminal) defined by the equation as well.
# Here, it's an absorbing state with zero reward onward, so the Bellman equation still holds.

I = np.eye(num_states)
v_pi_closed = np.linalg.solve(I - gamma * P_pi, r_pi)

print("Closed-form v^π (go Right):", np.round(v_pi_closed, 3))

Output:

Closed-form v^π (go Right): [2.8 2.  0. ]

2. Bellman Operator & Iterative Policy Evaluation

Define the Bellman expectation operator for policy $ \pi $:

\[(T^\pi v)(s) = \mathbb{E}_\pi \big[ R_{t+1} + \gamma v(S_{t+1}) \mid S_t = s \big].\]

In vector form:

\[T^\pi v = r^\pi + \gamma P^\pi v.\]

The Bellman expectation equation says:

\[v^\pi = T^\pi v^\pi.\]

For $ \gamma \in [0,1) $, $ T^\pi $ is a contraction mapping under the max-norm:

\[\| T^\pi v - T^\pi w \|_\infty \le \gamma \| v - w \|_\infty,\]

which guarantees that repeated application converges:

\[v_{k+1} = T^\pi v_k \quad \Rightarrow \quad v_k \to v^\pi.\]

This “fixed point iteration” is the core of Dynamic Programming (Policy Evaluation) and is conceptually similar to TD learning.

Code:

def bellman_T_pi(v, r_pi, P_pi, gamma):
    """Apply Bellman expectation operator T^π to value vector v."""
    return r_pi + gamma * (P_pi @ v)

def iterative_policy_evaluation(r_pi, P_pi, gamma, num_iters=30, v_init=None):
    if v_init is None:
        v = np.zeros(P_pi.shape[0])
    else:
        v = v_init.copy()
    history = [v.copy()]
    for _ in range(num_iters):
        v = bellman_T_pi(v, r_pi, P_pi, gamma)
        history.append(v.copy())
    return np.array(history)

v_hist = iterative_policy_evaluation(r_pi, P_pi, gamma, num_iters=25)
print("Last iterate v_25:", np.round(v_hist[-1], 3))
print("Closed-form v^π  :", np.round(v_pi_closed, 3))

Output:

Last iterate v_25: [2.8 2.  0. ]
Closed-form v^π  : [2.8 2.  0. ]

Code:

iters = np.arange(v_hist.shape[0])
errors = np.linalg.norm(v_hist - v_pi_closed, ord=np.inf, axis=1)

plt.figure(figsize=(6,3.5))
plt.plot(iters, errors, marker="o")
plt.yscale("log")
plt.xlabel("iteration k")
plt.ylabel(r"$\|v_k - v^\pi\|_\infty$")
plt.title("Iterative Policy Evaluation — Bellman Expectation Convergence (log scale)")
plt.tight_layout()
plt.show()

plt.figure(figsize=(6,3.5))
for s in range(num_states):
    plt.plot(iters, v_hist[:, s], marker="o", label=f"v_k({STATE_NAMES[s]})")
plt.axhline(v_pi_closed[0], color="C0", linestyle="--", alpha=0.5)
plt.axhline(v_pi_closed[1], color="C1", linestyle="--", alpha=0.5)
plt.axhline(v_pi_closed[2], color="C2", linestyle="--", alpha=0.5)
plt.xlabel("iteration k")
plt.ylabel("value")
plt.title("Iterative Policy Evaluation — per-state values")
plt.legend()
plt.tight_layout()
plt.show()

png

3. Bellman Expectation Equation — Action-Value

The action-value function under policy $ \pi $:

\[q^\pi(s,a) = \mathbb{E}_\pi \big[ G_t \mid S_t = s, A_t = a \big].\]

Bellman expectation equation:

\[q^\pi(s,a) = \sum_{s',r} p(s', r \mid s,a)\,\Big( r + \gamma \sum_{a'} \pi(a' \mid s') q^\pi(s',a') \Big).\]

Relationship between $ v^\pi $ and $ q^\pi $:

\[v^\pi(s) = \sum_a \pi(a \mid s)\,q^\pi(s,a).\]

Later, algorithms like SARSA and Expected SARSA will be directly implementing these relationships in sample-based form.

4. Bellman Optimality — Searching for the Best Policy

We define the optimal value functions:

Optimal state-value:
\[v^*(s) = \max_\pi v^\pi(s).\]
Optimal action-value:
\[q^*(s,a) = \max_\pi q^\pi(s,a).\]

These satisfy the Bellman optimality equations:

\[v^*(s) = \max_a \sum_{s',r} p(s', r \mid s,a)\,\big( r + \gamma v^*(s') \big),\] \[q^*(s,a) = \sum_{s',r} p(s', r \mid s,a)\,\big( r + \gamma \max_{a'} q^*(s', a') \big).\]

Given $ v^* $, we can extract a greedy optimal policy:

\[\pi^*(s) \in \arg\max_a \sum_{s',r} p(s',r \mid s,a)\,\big( r + \gamma v^*(s') \big).\]

These equations are at the heart of:

Dynamic Programming: Policy Iteration & Value Iteration
Value-Based RL: Q-Learning, DQN, etc.

In this notebook, we’ll solve $ v^* $ and $ \pi^* $ for our tiny MDP by directly applying the optimality operator.

Code:

def bellman_optimality_operator(v, gamma):
    """(T v)(s) = max_a E[r + γ v(s')] over actions."""
    v_new = np.zeros_like(v)
    for s in range(num_states):
        # Evaluate each action's expected return
        q_vals = []
        for a in range(num_actions):
            total = 0.0
            for p, s_next, r in P_env[(s, a)]:
                total += p * (r + gamma * v[s_next])
            q_vals.append(total)
        v_new[s] = max(q_vals)
    return v_new

def value_iteration(gamma, num_iters=50):
    v = np.zeros(num_states)
    history = [v.copy()]
    for _ in range(num_iters):
        v = bellman_optimality_operator(v, gamma)
        history.append(v.copy())
    return np.array(history)

v_opt_hist = value_iteration(gamma, num_iters=25)
v_star = v_opt_hist[-1]

print("Approx optimal v*:", np.round(v_star, 3))

Output:

Approx optimal v*: [7.156 6.94  0.   ]

Code:

def greedy_policy_from_v(v, gamma):
    """Compute greedy policy w.r.t. given value function."""
    policy = {}
    for s in range(num_states):
        q_vals = []
        for a in range(num_actions):
            total = 0.0
            for p, s_next, r in P_env[(s, a)]:
                total += p * (r + gamma * v[s_next])
            q_vals.append(total)
        best_a = int(np.argmax(q_vals))
        policy[s] = best_a
    return policy

pi_star = greedy_policy_from_v(v_star, gamma)

print("Greedy policy π* (indices):", pi_star)
print("Greedy policy π* (named):", {STATE_NAMES[s]: ACTION_NAMES[a] for s, a in pi_star.items()})

Output:

Greedy policy π* (indices): {0: 1, 1: 0, 2: 0}
Greedy policy π* (named): {'s0': 'Right', 's1': 'Left', 'terminal': 'Left'}

Code:

iters_opt = np.arange(v_opt_hist.shape[0])

plt.figure(figsize=(6,3.5))
for s in range(num_states):
    plt.plot(iters_opt, v_opt_hist[:, s], marker="o", label=f"v_k({STATE_NAMES[s]})")
plt.xlabel("iteration k")
plt.ylabel("value")
plt.title("Bellman Optimality — Value Iteration on Tiny MDP")
plt.legend()
plt.tight_layout()
plt.show()

png

Key Takeaways

Bellman expectation equations relate $ v^\pi $ to immediate rewards and the next-state values:
\[v^\pi = r^\pi + \gamma P^\pi v^\pi.\]
The Bellman operator $T^\pi$ is a contraction, so iterating $ v_{k+1} = T^\pi v_k $ converges to $ v^\pi $.
Bellman optimality equations define $ v^*, q^* $ via a max over actions and form the basis of control:
\[v^*(s) = \max_a \mathbb{E}[R_{t+1} + \gamma v^*(S_{t+1}) \mid s,a ].\]
Small MDPs can be solved exactly via linear systems (policy evaluation) or via value iteration (control).
Dynamic Programming, Monte Carlo, and Temporal-Difference methods are all different ways to approximate these fixed points in increasingly realistic settings.

Next: 13_dp_policy_evaluation.ipynb → formalizing iterative policy evaluation and connecting it to Monte Carlo & TD.