APS360

Applied Fundamentals of Deep Learning

Winter 2026  ◊  Final Exam Review

Artificial Neural Networks, Part I

Week 2  •  Foundations of Training

Training a neural network is an iterative optimization process. At its core, the network makes a prediction (forward pass), measures how wrong it was (loss computation), calculates how each weight contributed to the error (backward pass), and then adjusts the weights to reduce that error (weight update). This cycle repeats until the network converges to a satisfactory level of performance.

The Training Loop

Every training iteration follows four steps:

  1. Forward Pass: Input data flows through the network layer by layer, producing a prediction y_hat.
  2. Loss Computation: A loss function measures the discrepancy between the prediction y_hat and the true target y.
  3. Backward Pass (Backpropagation): Gradients of the loss with respect to every weight are computed using the chain rule.
  4. Weight Update: Weights are adjusted in the direction that reduces the loss, scaled by the learning rate.
Training loop diagram showing forward pass, loss computation, backward pass, and weight update
Fig. 2.1 — The neural network training loop: forward pass, loss, backward pass, update

Loss Functions

The choice of loss function depends on the task type:

Mean Squared Error (MSE) — Regression

MSE = (1/n) ∑ (y_i - ŷ_i)²

MSE penalizes large errors quadratically. Suitable for continuous outputs (e.g., predicting housing prices, temperature).

Mean Squared Error loss function diagram
Fig. 2.2 — Mean Squared Error for regression tasks

Cross-Entropy Loss — Classification

CE = -∑ y_i · log(ŷ_i)

Cross-entropy measures the divergence between the predicted probability distribution and the true distribution. It is the standard loss for multi-class classification problems.

Cross-entropy loss function diagram
Fig. 2.3 — Cross-Entropy Loss for classification tasks

Binary Cross-Entropy — Two-Class Problems

BCE = -[y · log(ŷ) + (1 - y) · log(1 - ŷ)]

Used for binary classification (e.g., spam vs not spam, real vs fake). The output is a single sigmoid probability.

Softmax Function

The softmax function converts raw logits (unnormalized scores) into a valid probability distribution where all values sum to 1:

softmax(x_i) = e^(x_i) / ∑_k e^(x_k)

Each output is in the range (0, 1), and the outputs are mutually exclusive. Softmax is applied as the final layer in multi-class classification networks.

Softmax function transforming logits to probabilities
Fig. 2.4 — Softmax normalizes logits into a probability distribution

One-Hot Encoding

For categorical labels, one-hot encoding converts each class into a binary vector with a single 1:

Cat  = [1, 0, 0]
Dog  = [0, 1, 0]
Bird = [0, 0, 1]

This enables cross-entropy loss to compare the predicted probability vector against the true label vector.

Gradient Descent

Gradient descent is the optimization algorithm that adjusts weights to minimize the loss function:

w^(t+1) = w^t - γ · ∂E/∂w

Where γ (gamma) is the learning rate, controlling the step size. The gradient ∂E/∂w indicates the direction of steepest ascent, so we move in the negative direction.

Backpropagation

Backpropagation computes gradients efficiently using the chain rule of calculus. For a composition of functions f(g(h(x))), the derivative is:

df/dx = (df/dg) · (dg/dh) · (dh/dx)

Starting from the loss, gradients flow backward through each layer, allowing every weight in the network to be updated proportionally to its contribution to the error.

Activation Functions

Activation functions introduce non-linearity, allowing networks to learn complex patterns:

Function Formula Range Notes
Sigmoid σ(x) = 1 / (1 + e^(-x)) (0, 1) Output interpretable as probability; suffers vanishing gradients
Tanh tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x)) (-1, 1) Zero-centered; still suffers vanishing gradients
ReLU f(x) = max(0, x) [0, ∞) Most popular; solves vanishing gradient; can have "dead neurons"
Leaky ReLU f(x) = max(0.01x, x) (-∞, ∞) Small slope for negatives; prevents dead neurons

Key Insight

Vanishing Gradient Problem: Sigmoid and Tanh squash inputs into small ranges. When many layers are stacked, gradients shrink exponentially during backpropagation (multiplying many small numbers). ReLU solves this because its gradient is either 0 or 1, allowing gradients to flow unchanged through active neurons.

# PyTorch: Simple Neural Network
import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784, 128),
    nn.ReLU(),
    nn.Linear(128, 64),
    nn.ReLU(),
    nn.Linear(64, 10),
    nn.Softmax(dim=1)
)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# Training step
output = model(x)          # Forward pass
loss = criterion(output, y) # Loss
loss.backward()             # Backward pass
optimizer.step()            # Weight update
optimizer.zero_grad()       # Clear gradients
◊ ◊ ◊

Artificial Neural Networks, Part II

Week 3  •  Hyperparameters, Optimization & Regularization

While model parameters (weights and biases) are learned during training, hyperparameters are set before training begins and govern the learning process itself. Choosing the right hyperparameters is critical: they determine the architecture, the optimization dynamics, and ultimately whether the network generalizes well to unseen data.

Hyperparameters vs Parameters

Parameters (weights, biases) are optimized during the inner training loop via gradient descent. Hyperparameters are set in the outer loop and include:

Hyperparameter optimization flow diagram showing inner and outer loops
Fig. 3.1 — Hyperparameter optimization: the outer loop wraps around the training inner loop

Hyperparameter Search

Two main strategies for finding good hyperparameters:

Grid search versus random search comparison
Fig. 3.2 — Grid search vs random search: random search covers more unique values per hyperparameter

Optimizers

Stochastic Gradient Descent (SGD)

Updates weights using a single training sample (or mini-batch) at a time. Noisier than batch gradient descent but computationally cheaper and can escape local minima.

SGD, mini-batch, and batch gradient descent comparison
Fig. 3.3 — SGD (1 sample) vs mini-batch (n samples) vs batch (all samples) gradient descent
Variant Batch Size Pros Cons
SGD 1 Fast updates, can escape local minima Very noisy, high variance
Mini-batch GD n (e.g., 32-256) Balanced noise/stability, GPU efficient Requires tuning batch size
Batch GD All samples Stable convergence, low variance Slow, memory expensive, stuck in local minima

SGD with Momentum

Momentum accumulates a "velocity" from past gradients to smooth out oscillations and accelerate convergence:

v_t = λ · v_(t-1) - γ · ∂E/∂w
w_(t+1) = w_t + v_t

Where λ is the momentum coefficient (typically 0.9). Think of a ball rolling down a hill — it builds up speed and can roll past small bumps.

SGD with momentum diagram showing velocity accumulation
Fig. 3.4 — Momentum dampens oscillations and accelerates convergence in consistent gradient directions

Adam Optimizer

Adam (Adaptive Moment Estimation) is the most commonly used optimizer. It combines momentum with adaptive per-parameter learning rates:

Adam optimizer combining momentum and adaptive learning rates
Fig. 3.5 — Adam: adaptive learning rates with momentum for each parameter

Key Insight

When in doubt, use Adam. It works well with default hyperparameters (lr=0.001, β1=0.9, β2=0.999) and requires minimal tuning. SGD with momentum can sometimes generalize better but requires more careful learning rate scheduling.

Learning Rate

The learning rate is arguably the most important hyperparameter:

Effect of learning rate size on training
Fig. 3.6 — Learning rate too small (slow), too large (diverge), and appropriate
Appropriate learning rate showing smooth convergence
Fig. 3.7 — An appropriately chosen learning rate yields smooth loss decrease

Learning Rate Schedules: Reduce the learning rate during training for fine-grained convergence:

Batch Size Tradeoffs

Batch size affects both training dynamics and generalization:

Normalization

Input Normalization (Standardization): Scale inputs to zero mean and unit variance. Helps optimization by making the loss landscape more spherical.

x_normalized = (x - μ) / σ

Batch Normalization: Normalize activations within each mini-batch at each layer. Reduces internal covariate shift, allows higher learning rates, acts as mild regularization.

Regularization

Techniques to prevent overfitting (model memorizing training data):

Evaluation Strategy

Data is split into three sets:

Overfitting: Low training loss, high validation loss. Model memorizes training data. Underfitting: High training loss, high validation loss. Model is too simple.

# PyTorch: Typical training setup with regularization
model = MyNetwork()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5)

for epoch in range(num_epochs):
    model.train()
    for x_batch, y_batch in train_loader:
        output = model(x_batch)
        loss = criterion(output, y_batch)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    scheduler.step()

    model.eval()
    with torch.no_grad():
        val_loss = criterion(model(x_val), y_val)
    # Early stopping: if val_loss increases for N epochs, stop
◊ ◊ ◊

Convolutional Neural Networks, Part I

Week 4  •  Convolution, Filters & Feature Detection

Applying fully connected networks to images is impractical. A 256×256 color image has 196,608 input features; connecting each to even 1,000 hidden neurons yields nearly 200 million parameters in the first layer alone. Convolutional Neural Networks solve this by exploiting spatial structure: local connectivity, weight sharing, and translation equivariance.

Why Not Fully Connected?

The Convolution Operation

A convolution slides a small filter (kernel) over the input, computing element-wise products and summing the results at each position. The filter acts as a feature detector.

2D convolution calculation showing kernel sliding over input
Fig. 4.1 — 2D convolution: kernel slides over input, element-wise multiply and sum
Sliding convolution animation showing kernel moving across the image
Fig. 4.2 — Convolution sliding: the kernel moves across the image computing dot products

Output Size Formula

Output size = (n - f + 2p) / s + 1

Where n = input size, f = filter size, p = padding, s = stride.

Filter Types

Different kernels detect different features:

Averaging (blur) filter example
Fig. 4.3 — Averaging filter produces a blurred output
Filter Kernel Detects
Averaging [[1/9]*9] Blurring / smoothing
Sobel Vertical [1,0,-1; 2,0,-2; 1,0,-1] Vertical edges
Sobel Horizontal [1,2,1; 0,0,0; -1,-2,-1] Horizontal edges
Laplacian [0,1,0; 1,-4,1; 0,1,0] Blobs / edges (second derivative)
Edge detection filter applied to an image
Fig. 4.4 — Sobel edge detection: vertical and horizontal edges highlighted

Stride and Padding

Stride: How many pixels the kernel moves between positions. Stride 1 = maximum overlap. Stride 2 = halves spatial dimensions.

Padding: Adding zeros around the border to control output size.

Multiple Channels

For RGB images (H × W × 3), the filter also has 3 channels. The convolution computes element-wise products across all channels and sums everything into a single 2D output. Multiple filters produce multiple output channels (feature maps).

Learned CNN kernels for face detection
Fig. 4.5 — CNNs learn filters automatically: these learned kernels detect facial features

Key Insight

CNNs learn filters. Instead of hand-designing edge detectors or blob detectors, CNNs learn optimal filters through backpropagation. The network discovers what features matter for the task. Early layers learn simple edges, deeper layers learn complex patterns like textures and object parts.

# PyTorch: Convolution layers
import torch.nn as nn

# Input: batch of RGB images (B, 3, 32, 32)
conv1 = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=1)
# Output: (B, 16, 32, 32) -- "same" padding preserves spatial dims

conv2 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, stride=2, padding=0)
# Output: (B, 32, 15, 15) -- stride=2 halves, no padding shrinks further

# Output size = (32 - 3 + 2*0) / 2 + 1 = 15
◊ ◊ ◊

Convolutional Neural Networks, Part II

Week 5  •  Architectures, Transfer Learning & Visualization

A complete CNN architecture consists of two parts: the encoder (feature extractor) that learns hierarchical representations through convolutional layers, and the classifier head that maps learned features to output predictions. Understanding landmark architectures and how to reuse them through transfer learning is essential for modern deep learning practice.

CNN Architecture Pipeline

Input → [Conv + ReLU → Pooling]* → Flatten → FC → Softmax

The convolutional blocks form the encoder (feature learning). The fully connected layers form the classifier head. Pooling (max or average) reduces spatial dimensions progressively.

Complete CNN pipeline from input to classification
Fig. 5.1 — CNN pipeline: encoder extracts features, classifier maps them to predictions
Multi-channel convolution producing a single value
Fig. 5.2 — Multi-channel convolution: all channels contribute to one output value
Multiple filters producing multiple feature maps
Fig. 5.3 — Multiple filters: each filter produces one feature map, stacked to form output volume

Visualizing CNN Features

What do CNNs actually learn? Visualization reveals a hierarchy:

Feature hierarchy: edges to textures to object parts
Fig. 5.4 — Feature hierarchy in deep CNNs: simple to complex representations

Saliency Maps

Saliency maps compute the gradient of the output class score with respect to the input image pixels. High-gradient regions indicate which pixels most influenced the classification decision.

Saliency map showing important regions for classification
Fig. 5.5 — Saliency maps: bright regions are most important for the model's prediction

Landmark Architectures

LeNet-5 (LeCun, 1989/1998)

The first successful CNN. Designed for handwritten digit recognition (MNIST). 7 layers: 2 convolutional + 2 pooling + 3 fully connected. Input: 32×32×1.

LeNet-5 architecture diagram
Fig. 5.6 — LeNet-5: the architecture that started it all

AlexNet (Krizhevsky et al., 2012)

The network that launched the deep learning revolution by winning ILSVRC 2012 with a ~10% improvement over the runner-up. Key innovations:

AlexNet architecture showing 8 layers
Fig. 5.7 — AlexNet architecture: the catalyst for modern deep learning

VGGNet (Simonyan & Zisserman, 2014)

Showed that depth matters. Uses only 3×3 filters throughout. VGG-16 (16 layers) and VGG-19 (19 layers) demonstrated that stacking small filters is more effective than using large ones.

ResNet (He et al., 2015)

Introduced skip connections (residual connections) that add the input of a block directly to its output:

output = F(x) + x

This solves the degradation problem (deeper networks performing worse than shallower ones) by allowing gradients to flow directly through skip connections. Enables training networks with 100+ layers (ResNet-50, ResNet-101, ResNet-152).

Key Insight

Skip connections are revolutionary. Without them, very deep networks suffer from the degradation problem — not because of overfitting, but because optimization becomes too difficult. Residual connections let the network learn "corrections" to the identity mapping, making it easy for layers to learn the identity function if that's optimal.

Transfer Learning

Instead of training from scratch, use a pre-trained model (e.g., trained on ImageNet with millions of images) and adapt it to your task:

# PyTorch: Transfer Learning with ResNet
import torchvision.models as models

# Load pre-trained ResNet-18
model = models.resnet18(pretrained=True)

# Freeze encoder
for param in model.parameters():
    param.requires_grad = False

# Replace classifier head
model.fc = nn.Linear(512, num_classes)  # Only this layer trains

# For fine-tuning, unfreeze later layers:
# for param in model.layer4.parameters():
#     param.requires_grad = True

Data Augmentation

Artificially increase training data by applying random transformations:

# PyTorch: Data Augmentation
from torchvision import transforms

train_transform = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(32, padding=4),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.RandomRotation(15),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225])
])
◊ ◊ ◊

Unsupervised Learning

Week 6  •  Autoencoders & Variational Autoencoders

Supervised learning requires labeled data, which is expensive and time-consuming to collect. Unsupervised learning discovers structure and patterns in data without labels. Autoencoders learn compressed representations by reconstructing their own input, while Variational Autoencoders extend this idea to generate new, realistic data by learning smooth, continuous latent spaces.

Autoencoders

An autoencoder has two parts:

The hourglass shape forces the network to learn the most important features of the data — it cannot simply copy the input through the bottleneck.

Autoencoder architecture with encoder, bottleneck, and decoder
Fig. 6.1 — Autoencoder architecture: input is compressed and then reconstructed
Autoencoder bottleneck forcing feature learning
Fig. 6.2 — The bottleneck forces the network to learn meaningful compressed representations

Training Autoencoders

The loss function is reconstruction loss — MSE between input and output:

L = (1/n) ∑ (x_i - x̂_i)²

Crucially, the input IS the target. The network learns an identity function, but constrained by the bottleneck to capture only the most salient features.

Stacked Autoencoders

Multiple hidden layers in both encoder and decoder, typically symmetrical. Deeper autoencoders can learn more complex, hierarchical representations.

Stacked autoencoder with multiple layers
Fig. 6.3 — Stacked autoencoder: deeper architecture for richer representations

Denoising Autoencoders

Add Gaussian noise to the input but train to reconstruct the clean original. This prevents the network from learning a trivial identity mapping and forces it to capture robust features.

Denoising autoencoder: noisy input, clean target
Fig. 6.4 — Denoising autoencoder: corrupted input, clean reconstruction target

Applications of Autoencoders

Generating Images via Interpolation

Encode two images to get their latent vectors, linearly interpolate between them, and decode the intermediate points to produce smooth transitions between images.

Interpolation in latent space between two images
Fig. 6.5 — Interpolation: smooth transitions by decoding latent space points between two images

The Problem with Standard Autoencoders

Standard autoencoders produce latent spaces that can be disjoint and non-continuous. Random sampling from such a space produces garbage output because there are "holes" where no training data maps.

Disjoint latent space problem in standard autoencoders
Fig. 6.6 — Problem: standard AE latent space is discontinuous — random sampling fails

Variational Autoencoders (VAE)

VAEs solve the generation problem by making the latent space smooth and continuous:

VAE Loss Function

L = Reconstruction Loss + KL Divergence
L = MSE(x, x̂) + D_KL(q(z|x) || p(z))

The reconstruction loss ensures the output looks like the input. The KL divergence regularizes the latent space to be smooth and close to N(0, 1), enabling meaningful interpolation and random sampling.

Reparameterization Trick

Sampling is a non-differentiable operation. The reparameterization trick makes it differentiable:

z = μ + σ · ε,   where ε ~ N(0, 1)

The randomness is isolated in ε (which doesn't depend on parameters), so gradients can flow through μ and σ during backpropagation.

Key Insight

VAE = Autoencoder + Probabilistic Latent Space. The key innovation is forcing the encoder to output distributions and regularizing them with KL divergence. This creates a smooth, continuous latent space where nearby points decode to similar outputs, enabling both generation (random sampling) and meaningful interpolation.

# PyTorch: VAE
class VAE(nn.Module):
    def __init__(self, input_dim, latent_dim):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 256), nn.ReLU(),
            nn.Linear(256, 128), nn.ReLU())
        self.fc_mu = nn.Linear(128, latent_dim)
        self.fc_logvar = nn.Linear(128, latent_dim)
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 128), nn.ReLU(),
            nn.Linear(128, 256), nn.ReLU(),
            nn.Linear(256, input_dim), nn.Sigmoid())

    def reparameterize(self, mu, logvar):
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)       # epsilon ~ N(0,1)
        return mu + std * eps             # z = mu + sigma * eps

    def forward(self, x):
        h = self.encoder(x)
        mu, logvar = self.fc_mu(h), self.fc_logvar(h)
        z = self.reparameterize(mu, logvar)
        return self.decoder(z), mu, logvar

# VAE Loss
def vae_loss(x_recon, x, mu, logvar):
    recon = nn.functional.mse_loss(x_recon, x, reduction='sum')
    kl = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
    return recon + kl
◊ ◊ ◊

Recurrent Neural Networks, Part I

Week 7  •  Word Embeddings: word2vec & GloVe

Before we can process text with neural networks, we need to represent words as numbers. One-hot encoding creates sparse, high-dimensional vectors with no notion of similarity. Word embeddings solve this by learning dense, low-dimensional vectors where semantically similar words are close together in the vector space.

The Problem with One-Hot Encoding

With a vocabulary of 50,000 words, each word is a 50,000-dimensional vector with a single 1. Problems:

Word Embeddings

Learned dense vectors (typically 50-300 dimensions) where:

word2vec

A family of architectures for learning word embeddings from text. Two main variants:

word2vec CBOW vs SkipGram architectures
Fig. 7.1 — word2vec: CBOW predicts target from context; Skip-Gram predicts context from target

Skip-Gram

Given a center (target) word, predict the surrounding context words within a window. The training creates (center, context) pairs:

Skip-Gram training pairs from a sentence
Fig. 7.2 — Skip-Gram: generating training pairs from a sliding window
Skip-Gram architecture with input, hidden, and output layers
Fig. 7.3 — Skip-Gram neural network architecture

After training, only the encoder (input-to-hidden) weights are kept. These weights ARE the word embeddings.

After training, output layer is discarded
Fig. 7.4 — After training, the output layer is discarded; hidden weights become the embeddings

CBOW (Continuous Bag of Words)

The reverse: given the context words, predict the center word. Generally faster to train, better for frequent words.

CBOW architecture predicting center word from context
Fig. 7.5 — CBOW: context words predict the center word
CBOW vs SkipGram comparison table
Fig. 7.6 — Comparison: CBOW is faster, Skip-Gram is better for rare words
Property CBOW Skip-Gram
Input Context words Center word
Output Center word Context words
Speed Faster training Slower training
Rare words Worse Better

GloVe (Global Vectors)

Unlike word2vec (which uses local context windows), GloVe uses global co-occurrence statistics. It constructs a co-occurrence matrix from the entire corpus and learns embeddings where:

w_i · w_j ≈ log(X_ij)

Where X_ij is the co-occurrence count of words i and j. The inner product of word vectors approximates the logarithm of their co-occurrence frequency.

Distance Measures

Key Insight

Cosine similarity is preferred for word embeddings because we care about the direction (semantic meaning) of vectors, not their magnitude. Two vectors pointing in the same direction are similar regardless of their length. Euclidean distance can be misleading when vectors have different magnitudes.

# PyTorch: Using pre-trained embeddings
import torch.nn as nn

# Embedding layer: lookup table mapping word indices to vectors
embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=300)

# Using pre-trained GloVe
# embedding.weight = nn.Parameter(glove_vectors)
# embedding.weight.requires_grad = False  # Freeze if using as fixed features

# Cosine similarity
cos_sim = nn.CosineSimilarity(dim=1)
similarity = cos_sim(embedding(word1_idx), embedding(word2_idx))
◊ ◊ ◊

Recurrent Neural Networks, Part II

Week 8  •  RNNs, LSTMs & GRUs

Many real-world problems involve sequential data: time series, natural language, audio, video. Unlike feedforward networks that process fixed-size inputs, Recurrent Neural Networks maintain a hidden state that captures information from previous time steps, enabling them to handle variable-length sequences. However, standard RNNs struggle with long-range dependencies, leading to the development of gated architectures like LSTM and GRU.

RNN Architecture

An RNN applies the same neural network at each time step, maintaining a hidden state that carries information forward:

h_t = σ_h(W_h · x_t + U_h · h_(t-1) + b_h)
y_t = σ_y(W_y · h_t + b_y)

The hidden state h_t is a function of both the current input x_t and the previous hidden state h_(t-1). Weights are shared across all time steps.

RNN cell diagram showing input, hidden state, and output
Fig. 8.1 — RNN cell: same weights applied at each time step with recurrent hidden state
Unrolled RNN showing shared weights across time steps
Fig. 8.2 — Unrolled RNN: same network applied repeatedly, hidden state carries context forward

RNN Types by Input/Output

Many-to-one RNN for sequence classification
Fig. 8.3 — Sequence-level prediction: use the last hidden state for classification
Many-to-many RNN for token-level predictions
Fig. 8.4 — Token-level prediction: each hidden state produces an output
Type Input Output Example
Many-to-One Sequence Single label Sentiment analysis, text classification
One-to-Many Single input Sequence Image captioning, music generation
Many-to-Many (same length) Sequence Sequence POS tagging, named entity recognition
Many-to-Many (diff length) Sequence Sequence Machine translation (encoder-decoder)

Vanishing and Exploding Gradients

During backpropagation through time, gradients are multiplied by the weight matrix W_h at each step:

h_t = (W_h)^t · h_0

If the largest eigenvalue of W_h is greater than 1, gradients explode. If less than 1, gradients vanish. This means standard RNNs cannot effectively learn long-range dependencies.

Vanishing gradients in RNNs
Fig. 8.5 — Vanishing gradients: earlier time steps receive negligible gradient signal
Mathematical explanation of vanishing and exploding gradients
Fig. 8.6 — Mathematical basis: repeated matrix multiplication causes exponential growth or decay

Solutions:

LSTM (Long Short-Term Memory)

LSTMs solve vanishing gradients by introducing a separate cell state (long-term memory) alongside the hidden state (short-term memory). Three gates control information flow:

LSTM overview showing cell state and three gates
Fig. 8.7 — LSTM architecture: cell state (top conveyor belt) with three control gates

1. Forget Gate

Decides what to forget from the cell state:

f_t = σ(W_f · [h_(t-1), x_t] + b_f)
LSTM forget gate diagram
Fig. 8.8 — Forget gate: sigmoid output (0 to 1) controls how much of past cell state to retain

2. Input Gate

Decides what new information to add:

i_t = σ(W_i · [h_(t-1), x_t] + b_i)
C̃_t = tanh(W_C · [h_(t-1), x_t] + b_C)
LSTM input gate diagram
Fig. 8.9 — Input gate: determines how much of the new candidate information to store

3. Cell State Update

C_t = f_t · C_(t-1) + i_t · C̃_t

Old cell state, selectively forgotten, plus new candidate values, selectively written.

LSTM cell state update
Fig. 8.10 — Cell update: combines forgotten past with new input

4. Output Gate

Decides what to output as hidden state:

o_t = σ(W_o · [h_(t-1), x_t] + b_o)
h_t = o_t · tanh(C_t)
LSTM output gate diagram
Fig. 8.11 — Output gate: filtered cell state becomes the new hidden state

Key Insight

Why LSTMs work: The cell state acts as a "conveyor belt" where information flows with only minor linear interactions (multiply by forget gate, add new info). Unlike the hidden state in vanilla RNNs (which undergoes matrix multiplication + nonlinearity at every step), the cell state update is additive. This makes it much easier for gradients to flow backward through many time steps without vanishing.

GRU (Gated Recurrent Unit)

A simplified version of LSTM with two gates (instead of three). Combines the forget and input gates into a single update gate, and merges cell state and hidden state:

GRU architecture with update and reset gates
Fig. 8.12 — GRU: simpler than LSTM with similar performance. Update gate + reset gate
Property LSTM GRU
Gates 3 (forget, input, output) 2 (update, reset)
States Cell state + hidden state Hidden state only
Parameters More (heavier) Fewer (lighter)
Performance Slightly better on long sequences Comparable, trains faster
# PyTorch: LSTM for sequence classification
class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, num_classes)

    def forward(self, x):
        embedded = self.embedding(x)           # (B, T, embed_dim)
        output, (h_n, c_n) = self.lstm(embedded) # h_n: last hidden state
        logits = self.fc(h_n.squeeze(0))       # Use last hidden state
        return logits

# GRU: simply replace nn.LSTM with nn.GRU
# self.gru = nn.GRU(embed_dim, hidden_dim, batch_first=True)
# output, h_n = self.gru(embedded)  # No cell state returned
◊ ◊ ◊

Generative Adversarial Networks

Week 9  •  GANs, DCGAN & Adversarial Training

While discriminative models learn the boundary between classes (p(y|x)), generative models learn the underlying data distribution itself (p(x)). Generative Adversarial Networks frame this learning problem as a game between two competing networks: a generator that creates fake data and a discriminator that tries to distinguish real from fake.

Generative vs. Discriminative Models

Discriminative vs generative model comparison
Fig. 9.1 — Discriminative models learn decision boundaries; generative models learn data distributions

Families of Generative Models

Why Not Just Autoencoders?

Autoencoders with MSE loss produce blurry images. MSE penalizes pixel-level differences, causing the model to predict the average pixel value (hedging its bets). GANs avoid this because the discriminator provides a more sophisticated, adversarial loss signal.

GAN Architecture

GAN architecture: generator and discriminator in adversarial setup
Fig. 9.2 — GAN architecture: generator creates fake data, discriminator classifies real vs fake
Generator and discriminator input/output specifications
Fig. 9.3 — Generator: noise → fake image. Discriminator: image → real/fake probability

The MinMax Game

min_G max_D V(D, G) = E[log D(x)] + E[log(1 - D(G(z)))]

GAN Training Algorithm

  1. Train Discriminator: Sample real images and fake images (from G). Train D to classify them correctly using BCE loss.
  2. Train Generator: Generate fake images, pass through D. Train G to maximize D's probability of classifying fakes as real. Do NOT update D during this step.
  3. Alternate between steps 1 and 2.

Conditional vs. Unconditional

Training Instabilities

DCGAN (Deep Convolutional GAN)

Uses convolutional layers for both G and D. Key design principles:

PyTorch DCGAN discriminator code
Fig. 9.4 — DCGAN discriminator: Conv2d layers with LeakyReLU and BatchNorm
PyTorch DCGAN generator code
Fig. 9.5 — DCGAN generator: ConvTranspose2d layers with ReLU and BatchNorm

Key Insight

GANs learn through competition. The generator never sees real data directly — it only receives gradient signals from the discriminator. As the discriminator gets better at detecting fakes, the generator must produce more realistic outputs to fool it. This adversarial dynamic pushes both networks to improve, but makes training inherently unstable compared to standard supervised learning.

# PyTorch: Simple GAN Training Loop
criterion = nn.BCELoss()
optim_D = torch.optim.Adam(D.parameters(), lr=2e-4, betas=(0.5, 0.999))
optim_G = torch.optim.Adam(G.parameters(), lr=2e-4, betas=(0.5, 0.999))

for epoch in range(num_epochs):
    for real_images, _ in dataloader:
        batch_size = real_images.size(0)
        real_labels = torch.ones(batch_size, 1)
        fake_labels = torch.zeros(batch_size, 1)

        # --- Train Discriminator ---
        z = torch.randn(batch_size, latent_dim)
        fake_images = G(z).detach()          # Don't update G here

        loss_D = criterion(D(real_images), real_labels) + \
                 criterion(D(fake_images), fake_labels)
        optim_D.zero_grad()
        loss_D.backward()
        optim_D.step()

        # --- Train Generator ---
        z = torch.randn(batch_size, latent_dim)
        fake_images = G(z)
        loss_G = criterion(D(fake_images), real_labels)  # Fool D
        optim_G.zero_grad()
        loss_G.backward()
        optim_G.step()
◊ ◊ ◊

Transformers

Week 10  •  Attention, Self-Attention & the Transformer Architecture

Recurrent architectures process sequences one step at a time, creating an inherent bottleneck for parallelization and making it difficult to capture long-range dependencies. The Transformer architecture, introduced in "Attention Is All You Need" (2017), replaces recurrence entirely with attention mechanisms, enabling massive parallelization and direct connections between any two positions in a sequence.

RNN Limitations

Comparison of RNN, LSTM, and GRU architectures
Fig. 10.1 — RNN vs LSTM vs GRU: all process sequentially, limiting parallelization

Attention Mechanism

Attention allows the model to focus on different parts of the input with different weights, rather than compressing everything into a single vector.

Attention heatmaps showing which input words are attended to
Fig. 10.2 — Attention heatmaps: the model learns to focus on relevant parts of the input

Simple Attention

For each position, compute a weighted sum of all positions' representations:

  1. Compute a score for each input position (via FC network, dot product, etc.)
  2. Normalize scores with softmax to get attention weights α
  3. Compute context vector: c_i = ∑ α_ij · h_j
Attention mechanism in encoder-decoder RNNs
Fig. 10.3 — Attention in RNN encoder-decoder: decoder attends to all encoder hidden states

Attention Score Methods

Method Formula
Dot product score(a, b) = a^T b
Cosine similarity score(a, b) = (a^T b) / (||a|| ||b||)
Bilinear score(a, b) = a^T W b
MLP / Additive score(a, b) = v^T tanh(W[a;b])

Self-Attention in Transformers

Self-attention computes attention between all positions within the same sequence. Each token looks at every other token (including itself) to gather context.

Three learned linear projections transform each input into:

Self-attention with Query, Key, Value matrices
Fig. 10.4 — Self-attention: each input is projected into Q, K, V via learned weight matrices

Scaled Dot-Product Attention

Attention(Q, K, V) = softmax(Q K^T / √d_k) · V

The scale factor √d_k prevents the dot products from becoming too large (which would push softmax into saturated regions with near-zero gradients).

Scaled dot-product attention computation
Fig. 10.5 — Scaled dot-product attention: QK^T gives attention scores, scaled, softmaxed, applied to V

Multi-Head Attention

Instead of a single attention function, split Q, K, V into h parallel "heads." Each head learns different attention patterns (e.g., syntactic, semantic, positional):

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) · W_O
where head_i = Attention(Q · W_Q^i, K · W_K^i, V · W_V^i)
Multi-head attention with parallel attention heads
Fig. 10.6 — Multi-head attention: parallel attention sub-spaces concatenated and projected

Transformer Encoder Block

Each encoder block consists of:

  1. Multi-Head Self-Attention
  2. Add & Layer Normalization (residual connection)
  3. Position-wise Feed-Forward Network: FFN(x) = max(0, xW_1 + b_1)W_2 + b_2
  4. Add & Layer Normalization (residual connection)

This block is repeated N times (typically N=6 in the original Transformer).

Transformer encoder block architecture
Fig. 10.7 — Transformer encoder block: self-attention + FFN, each with residual connections and layer norm

Positional Encoding

Transformers have no recurrence, so they have no inherent notion of position. Positional encodings are added to the input embeddings:

PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

Each position gets a unique encoding. The sinusoidal pattern allows the model to generalize to sequence lengths not seen during training.

Positional encoding visualization
Fig. 10.8 — Positional encoding: sinusoidal functions provide unique position signals

Key Insight

RNN vs. Transformer: RNNs are O(n) sequential steps; information from position 1 must pass through every intermediate position to reach position n. Transformers connect every position to every other position directly in O(1), with O(n^2) total attention computations that can all happen in parallel. This makes Transformers dramatically faster to train and better at capturing long-range dependencies.

Transformer Variants

BERT (Bidirectional Encoder Representations from Transformers)

Encoder-only transformer. Pre-trained with Masked Language Modeling (randomly mask 15% of tokens, predict them). Bidirectional — attends to both left and right context. Fine-tuned for downstream NLP tasks (classification, NER, QA).

GPT (Generative Pre-trained Transformer)

Decoder-only transformer. Uses masked (causal) self-attention — each position can only attend to earlier positions (autoregressive generation). Pre-trained with next-token prediction. Generates text left-to-right.

Vision Transformer (ViT)

Applies the transformer to images: split the image into fixed-size patches (e.g., 16×16), flatten each patch, linearly project, add positional encoding, and process as a sequence of tokens with a standard transformer encoder.

# PyTorch: Transformer Encoder
encoder_layer = nn.TransformerEncoderLayer(
    d_model=512,    # Embedding dimension
    nhead=8,        # Number of attention heads
    dim_feedforward=2048,
    dropout=0.1
)
transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=6)

# Self-attention from scratch
def scaled_dot_product_attention(Q, K, V):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
    weights = torch.softmax(scores, dim=-1)
    return torch.matmul(weights, V)
◊ ◊ ◊

Graph Neural Networks

Week 11  •  Message Passing, GCN & GAT

CNNs excel on grid-structured data (images) and RNNs on sequential data (text), but many real-world problems have non-Euclidean structure: molecular graphs, social networks, 3D meshes, knowledge graphs. Graph Neural Networks extend deep learning to arbitrary graph structures, learning representations that respect the topology of the data.

Motivation

CNN for grids, RNN for sequences, GNN for graphs
Fig. 11.1 — Different data structures require different architectures: grids, sequences, graphs

Graph Definitions

A graph G = (V, E, X) consists of:

Graph definition with nodes, edges, and features
Fig. 11.2 — Graph: nodes connected by edges, each node has a feature vector

Adjacency Matrix

A square matrix A where a_ij = 1 if there is an edge between nodes i and j, 0 otherwise. For undirected graphs, A is symmetric.

Adjacency matrix representation of a graph
Fig. 11.3 — Adjacency matrix: binary representation of graph connectivity

Degree

The degree d(i) of a node is the number of edges connected to it: d(i) = ∑_j a_ij

Order Invariance

Graphs are order-invariant: the same graph can be represented with n! different node orderings. A valid GNN must produce the same output regardless of how nodes are numbered.

Same graph with different node orderings
Fig. 11.4 — Order invariance: same graph, different node labelings — GNN must give same result

Key Insight

Transformers and Graphs: A transformer without positional encoding is equivalent to a fully-connected graph where every node attends to every other node with learned edge weights (attention scores). Graphs generalize this: instead of full connectivity, only neighboring nodes exchange information.

Message Passing

The core operation in GNNs. For each node, at each layer:

  1. Aggregate: Collect embeddings from all neighbor nodes
  2. Combine: Merge aggregated neighbor information with the node's own embedding
  3. Update: Apply a transformation (e.g., linear layer + activation)
h_v^(k) = COMBINE(h_v^(k-1), AGGREGATE({h_u^(k-1) : u ∈ N(v)}))
Message passing: neighbors send information to central node
Fig. 11.5 — Message passing: each node aggregates information from its neighbors
Message passing formula and computation
Fig. 11.6 — Message passing formula: aggregate + combine + update

Aggregation functions must be order-invariant (permutation invariant):

GNN Tasks

Node classification task in GNN
Fig. 11.7 — Node classification: predict a label for each node (e.g., atom type in a molecule)
Graph classification task in GNN
Fig. 11.8 — Graph classification: predict a label for the entire graph (e.g., molecule toxicity)
Link prediction task in GNN
Fig. 11.9 — Link prediction: predict whether an edge should exist between two nodes

Graph-Level Readout (Pooling)

For graph-level tasks, aggregate all node embeddings into a single graph embedding:

h_G = READOUT({h_v^(K) | v ∈ G})

Common readout functions: sum, mean, max over all node embeddings.

Readout/pooling aggregating node embeddings into graph embedding
Fig. 11.10 — Readout: all node embeddings aggregated into a single graph-level vector

Graph Convolutional Networks (GCN)

The simplest GNN. Each layer computes:

H^(l+1) = ReLU(A · H^(l) · W^(l) + b^(l))

Limitations of naive GCN:

GCN formula with normalization
Fig. 11.11 — GCN with normalization: degree-normalized adjacency with self-loops

Normalized GCN

H^(l+1) = ReLU(D̂^(-1/2) · Â · D̂^(-1/2) · H^(l) · W^(l))

Where  = A + I (adjacency with self-loops) and D̂ is the degree matrix of Â. The symmetric normalization ensures consistent scaling regardless of node degree.

Graph Attention Networks (GAT)

Instead of treating all neighbors equally (GCN) or using simple mean/sum aggregation, GAT uses attention to learn different importance weights for different neighbors:

# PyTorch Geometric: GCN
from torch_geometric.nn import GCNConv

class GCN(nn.Module):
    def __init__(self, in_features, hidden_dim, num_classes):
        super().__init__()
        self.conv1 = GCNConv(in_features, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, num_classes)

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index)
        x = torch.relu(x)
        x = self.conv2(x, edge_index)
        return x

# For graph classification, add global pooling:
# from torch_geometric.nn import global_mean_pool
# graph_embedding = global_mean_pool(x, batch)
# output = self.classifier(graph_embedding)
♦ ♦ ♦ ♦ ♦

Final Exam Review

20 Questions  •  ~100 Points Total

Define the vanishing gradient problem. Which activation functions are most susceptible to it, and how does ReLU address it?

What is the difference between a hyperparameter and a parameter in neural network training? Give two examples of each.

Explain why random search is generally preferred over grid search for hyperparameter optimization.

A convolutional layer receives an input of size 32 × 32 × 3 and applies 16 filters of size 5 × 5 with stride 1 and padding 2. (a) What is the output size? (b) How many trainable parameters does this layer have (including biases)?

What is the purpose of pooling layers in a CNN? Name two common types.

Compare and contrast AlexNet, VGGNet, and ResNet. For each, state (a) the key architectural innovation, and (b) why it was significant.

Explain the difference between feature extraction and fine-tuning in transfer learning. When would you choose one over the other?

Explain how a Variational Autoencoder (VAE) differs from a standard autoencoder. What is the reparameterization trick and why is it necessary?

What is the difference between Skip-Gram and CBOW in word2vec? Which is better for rare words and why?

Why is cosine similarity preferred over Euclidean distance for comparing word embeddings?

Draw or describe the LSTM cell architecture. Name all three gates and explain the role of each. How does the cell state address the vanishing gradient problem?

What is mode collapse in GAN training? Why does it happen and what strategies can mitigate it?

Explain the Scaled Dot-Product Self-Attention mechanism step by step. What are Q, K, and V? Why is the scaling factor √d_k used? Write the full formula.

What is Multi-Head Attention? Why is it better than single-head attention? Write the formula.

Compare BERT and GPT. How do they differ in architecture, pre-training objective, and use cases?

Why do Transformers need positional encoding? What happens if you remove it? Describe the sinusoidal encoding scheme.

Explain the message-passing framework in GNNs. What three operations does each layer perform? Why must the aggregation function be order-invariant?

Compare GCN and GAT. What problem does the naive GCN (H = ReLU(AXW)) have, and how is it fixed? How does GAT improve upon GCN?

You are building a system to classify molecules as toxic or non-toxic. Each molecule is represented as a graph where atoms are nodes and bonds are edges. (a) What type of GNN task is this? (b) Describe the full pipeline from input graph to classification output. (c) What aggregation function would you use and why? (d) How would you handle the fact that different molecules have different numbers of atoms?

You need to build a model to generate realistic face images conditioned on attributes (e.g., "male, smiling, brown hair"). (a) Compare using a VAE vs a conditional GAN for this task — list advantages and disadvantages of each. (b) Describe the architecture and training procedure for the conditional GAN approach. (c) What is one common failure mode and how would you detect it?

♦ ♦ ♦ ♦ ♦

Exam Checklist

Track your preparation  •  Progress saved locally

Overall Progress 0 / 30 complete
Training loop: forward, loss, backward, update
Loss functions: MSE, CE, BCE
Activation functions & vanishing gradients
Hyperparameters vs parameters
SGD, momentum, Adam optimizer
Regularization: L1, L2, dropout, early stopping
Convolution operation & output size formula
Filter types: blur, edge, blob detection
Stride, padding, multi-channel convolution
CNN architectures: LeNet, AlexNet, VGG, ResNet
Transfer learning: feature extraction vs fine-tuning
Data augmentation & saliency maps
Autoencoders: architecture, training, denoising
VAE: KL divergence, reparameterization trick
Latent space interpolation & generation
word2vec: Skip-Gram vs CBOW
GloVe & co-occurrence matrix
Cosine similarity vs Euclidean distance
RNN architecture & unrolling
LSTM: forget, input, output gates & cell state
GRU vs LSTM comparison
GAN: generator, discriminator, minmax game
DCGAN architecture & ConvTranspose2d
Training instabilities: mode collapse, non-convergence
Self-attention: Q, K, V & scaled dot product
Multi-head attention & transformer encoder block
BERT vs GPT vs ViT
Graph representation: adjacency matrix, degree
Message passing & readout (graph pooling)
GCN normalization & GAT attention