PyTorch Basics

From tensors and autograd to training your first MLP — the essentials for deep RL.

What you’ll learn

  • torch.Tensor and autograd (automatic differentiation)
  • Building models with nn.Module / nn.Sequential
  • Losses, optimizers (SGD, Adam), and training loops
  • Classification & regression mini‑examples
  • Plotting learning curves

Policies and value functions in deep RL are just neural networks trained with these fundamentals (e.g., PPO actor & critic).

Code:

import numpy as np
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader

torch.manual_seed(0)
if torch.cuda.is_available():
    device = torch.device("cuda")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cpu")

print(f"Using device: {device}")

Output:

Using device: mps

1. Autograd in a Nutshell

PyTorch’s autograd system automatically computes derivatives of tensor operations using reverse-mode differentiation, the same principle behind backpropagation.

When a tensor is created with

x = torch.tensor([...], requires_grad=True)

PyTorch records all operations on it in a computation graph. When you call .backward(), it traverses this graph in reverse to compute gradients efficiently.

Mathematical Intuition

For a scalar function

\[y = f(x_1, x_2, \dots, x_n),\]

autograd computes

\[\frac{\partial y}{\partial x_i}\]

for each parameter via the chain rule:

\[\frac{\partial y}{\partial x_i} = \sum_j \frac{\partial y}{\partial z_j} \frac{\partial z_j}{\partial x_i}.\]

In vector form, this becomes efficient reverse-mode differentiation, ideal when the number of outputs ≪ number of parameters — exactly the case in neural networks.

RL Connection

  • Policy networks, value functions, and Q-networks all rely on autograd for computing
  • \[\nabla_\theta J(\pi_\theta) = \mathbb{E}*\pi [\nabla*\theta \log \pi_\theta(a \mid s) , G_t].\]
  • Autograd automates gradient computation for loss functions like:
    • Policy gradient loss
    • Temporal-difference (TD) error
    • Actor-critic combined objectives
  • This makes it easy to experiment and debug custom RL algorithms without manual gradient derivations.

Code:

# Define tensors with gradient tracking enabled
x = torch.tensor([2.0, -1.0], requires_grad=True)
w = torch.tensor([1.5, -3.0], requires_grad=True)
b = torch.tensor(0.5, requires_grad=True)

# Forward pass: simple linear model y = w·x + b
y = torch.dot(w, x) + b

# Backward pass: compute gradients dy/dx, dy/dw, dy/db
y.backward()

print(f"Output y: {y.item():.3f}")
print("dy/dx:", x.grad)
print("dy/dw:", w.grad)
print("dy/db:", b.grad)

Output:

Output y: 6.500
dy/dx: tensor([ 1.5000, -3.0000])
dy/dw: tensor([ 2., -1.])
dy/db: tensor(1.)

2. A Tiny MLP for Classification

A Multi-Layer Perceptron (MLP) is a feedforward neural network composed of linear layers and nonlinear activations. It learns a function $ f_\theta: \mathbb{R}^n \to \mathbb{R}^k $ by minimizing a loss (e.g., cross-entropy for classification).

Structure

For an input $x \in \mathbb{R}^n$:

\[h_1 = \sigma(W_1 x + b_1), \quad \hat{y} = \text{softmax}(W_2 h_1 + b_2)\]

where:

  • $W_i, b_i$ are learnable parameters,
  • $\sigma(\cdot)$ is a nonlinear activation (e.g., ReLU, tanh),
  • The softmax maps logits to class probabilities.

The model learns by minimizing:

\[\mathcal{L}(\theta) = -\frac{1}{N}\sum_{i=1}^N y_i^\top \log \hat{y}_i\]

using stochastic gradient descent or Adam.

Intuition

Each layer transforms the input into a more linearly separable space.
Nonlinear activations allow the network to represent complex decision boundaries.

RL Connection

  • Policy networks and Q-networks in RL are often MLPs mapping states to:
    • Action probabilities (policy output via softmax)
    • Value estimates $ V_\theta(s) $
  • The same forward-backward mechanics apply:
    • Forward pass → compute predicted reward/value.
    • Backward pass → compute gradient and update parameters.

Code:

# Reproducibility
np.random.seed(0)
torch.manual_seed(0)

# Data: two Gaussian blobs
n_per = 300
cov = 0.3 * np.eye(2)
X0 = np.random.multivariate_normal([0, 0], cov, size=n_per)
X1 = np.random.multivariate_normal([2, 2], cov, size=n_per)
X = np.vstack([X0, X1]).astype(np.float32)
y = np.concatenate([np.zeros(n_per), np.ones(n_per)]).astype(np.int64)

# Convert to tensors and build loader
X_t = torch.from_numpy(X)
y_t = torch.from_numpy(y)
ds = TensorDataset(X_t, y_t)
dl = DataLoader(ds, batch_size=64, shuffle=True)

# Model
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(2, 32),
            nn.ReLU(),
            nn.Linear(32, 2)
        )
    def forward(self, x):
        return self.net(x)

print("Using device: ", device)
model = MLP().to(device)
opt = torch.optim.Adam(model.parameters(), lr=1e-2)
criterion = nn.CrossEntropyLoss()

# Train
loss_hist = []
for epoch in range(30):
    model.train()
    running = 0.0
    for xb, yb in dl:
        xb, yb = xb.to(device), yb.to(device)
        opt.zero_grad()
        logits = model(xb)
        loss = criterion(logits, yb)
        loss.backward()
        opt.step()
        running += loss.item() * len(xb)
    loss_hist.append(running / len(ds))

# Evaluate accuracy
model.eval()
with torch.no_grad():
    logits = model(X_t.to(device))
    y_pred = torch.argmax(logits, dim=1).cpu().numpy()
acc = (y_pred == y).mean()
print(f"Training accuracy: {acc:.4f}")

# Plot loss
plt.figure(figsize=(6,3.5))
plt.plot(loss_hist)
plt.title("MLP training loss")
plt.xlabel("epoch"); plt.ylabel("loss")
plt.tight_layout(); plt.show()

# Decision boundary visualization
pad = 0.5
x_min, x_max = X[:,0].min() - pad, X[:,0].max() + pad
y_min, y_max = X[:,1].min() - pad, X[:,1].max() + pad
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 300),
                     np.linspace(y_min, y_max, 300))
grid = np.c_[xx.ravel(), yy.ravel()].astype(np.float32)

with torch.no_grad():
    logits_grid = model(torch.from_numpy(grid).to(device))
    pred_grid = torch.argmax(logits_grid, dim=1).cpu().numpy().reshape(xx.shape)

plt.figure(figsize=(5.6,4.8))
plt.contourf(xx, yy, pred_grid, alpha=0.25, levels=[-0.5,0.5,1.5])
plt.scatter(X0[:,0], X0[:,1], s=16, label="class 0", edgecolor="k")
plt.scatter(X1[:,0], X1[:,1], s=16, label="class 1", edgecolor="k")
plt.title("MLP decision boundary")
plt.xlabel("x1"); plt.ylabel("x2"); plt.legend()
plt.tight_layout(); plt.show()

Output:

Using device:  mps
Training accuracy: 1.0000

png

png

3. Regression with MSE

In regression, the goal is to predict a continuous target $ y \in \mathbb{R} $ from input features $ x \in \mathbb{R}^n $. A neural model (or any differentiable function) $ f_\theta(x) $ is trained by minimizing the Mean Squared Error (MSE):

\[\mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^{N} \big(f_\theta(x_i) - y_i\big)^2\]

Here:

  • $ f_\theta(x) $ → predicted value
  • $ y_i $ → ground truth
  • The gradient w.r.t. parameters $ \theta $ points toward reducing the prediction error.

Intuition

The model learns to fit a regression line (or surface) through the data that minimizes squared deviations between predictions and actual values.
Unlike classification, the output layer here is linear (no softmax or sigmoid).

RL Connection

  • In value function approximation, the critic predicts expected returns $ V_\theta(s) $.
  • The training signal comes from bootstrapped targets, e.g.:

    \[\mathcal{L}(\theta) = \mathbb{E}\big[(r_t + \gamma V_\theta(s_{t+1}) - V_\theta(s_t))^2\big]\]
  • This is the same MSE objective, but the targets are dynamic (changing as learning progresses).

Understanding regression with MSE provides the mathematical and coding foundation for critic updates, Q-learning, and actor-critic algorithms in deep reinforcement learning.

Code:

# Synthetic data: y = 2x - 3 + noise
xs = np.linspace(-2, 2, 200, dtype=np.float32).reshape(-1, 1)
ys = (2 * xs - 3 + 0.3 * np.random.randn(*xs.shape)).astype(np.float32)

Xt = torch.from_numpy(xs).to(device)
Yt = torch.from_numpy(ys).to(device)

# Model: small MLP (can fit linear smoothly)
reg = nn.Sequential(
    nn.Linear(1, 32),
    nn.ReLU(),
    nn.Linear(32, 1)
).to(device)
print("Using device: ", device)

opt = torch.optim.SGD(reg.parameters(), lr=1e-2, momentum=0.9)
mse = nn.MSELoss()

# Train
losses = []
for epoch in range(200):
    opt.zero_grad()
    pred = reg(Xt)
    loss = mse(pred, Yt)
    loss.backward()
    opt.step()
    losses.append(loss.item())

# Metrics: MSE & R^2 on full dataset
with torch.no_grad():
    yp = reg(Xt).cpu().numpy()
mse_val = float(np.mean((ys - yp)**2))
ss_res = float(np.sum((ys - yp)**2))
ss_tot = float(np.sum((ys - ys.mean())**2))
r2 = 1.0 - ss_res / ss_tot
print(f"MSE: {mse_val:.4f}   R^2: {r2:.4f}")

# Plot training loss
plt.figure(figsize=(6, 3.5))
plt.plot(losses)
plt.title("Regression — Training Loss (MSE)")
plt.xlabel("epoch"); plt.ylabel("MSE")
plt.tight_layout(); plt.show()

# Visualize fit
plt.figure(figsize=(6, 3.5))
plt.plot(xs, ys, ".", alpha=0.55, label="data")
plt.plot(xs, yp, "-", label="prediction")
plt.title("Regression Fit: y ≈ f(x)")
plt.xlabel("x"); plt.ylabel("y")
plt.legend(); plt.tight_layout(); plt.show()

Output:

Using device:  mps
MSE: 0.0905   R^2: 0.9838

png

png

Key Takeaways

  • Autograd — PyTorch automatically computes gradients via backpropagation.
  • nn.Module & Optimizers — Simplify model definition and parameter updates.
  • Foundation for RL — Core of policy, value, and world model training in deep RL.

Next: 05_gymnasium_basics.ipynb → explore RL environments and agent–environment loops.