Ryan Han

Variational Autoencoders

2026-05-23T00:00:00+00:00

Solves the problem of training probabilistic models efficiently when the underlying mathematics is intractable, and introduced the variational autoencoder (VAE)

Background

Latent Variable $z$

A latent variable $z$ is a hidden factor that explains observed data
- “Hidden” means not explicitly given to the model as an observed variable
For an image dataset, the observed variable is the raw pixels of image $x$
However, the image may have hidden explanatory factors that determine what we see including lighting, camera angle, zoom level, background, etc.
The model compresses these hidden factors into the latent variable $z$
For example with the MNIST image dataset, if $x$ is a handwritten image of a “3”, then the latent variable $z$ might encode things like digit identity, rotation, writing style, etc.
Latent variables matter because, instead of memorizing pixels, the model learns what underlying causes could have produced the image

Prior over Latent Variables $p(z)$

The prior defines what latent vectors are considered likely before seeing the data
In VAEs, the prior is usually Gaussian: $p(z) = \mathcal{N}(0,1)$
- A Gaussian prior is smooth, continuous, and easy to sample from
- Furthermore, nearby latent points decode into similar outputs

Decoder / Generative Model $p_\theta(x \mid z)$

This is the neural network that generates data
$\theta$ represents the parameters (weights and biases) of the model
$p_{\theta}(x \mid z)$ represents the probability of generating data $x$, given the latent vector $z$, modeled by a network with parameters $\theta$
This is a generative model because we can generate new data in two steps
- Sample a latent vector $z \sim p(z)$
- Decode the latent vector $x \sim p_{\theta}(x \mid z)$
  - The latent vector is inputted into the generative model which outputs a distribution over possible outputs from which $x$ is sampled from

Normal Autoencoders

Normal autoencoders only learn $x \to z \to x$ by minimizing reconstruction error
However, this does not guarantee a meaningful latent space
In a meaningful latent space, nearby latent points correspond to similar data e.g. moving in one direction rotates the face or increases the smile
For VAEs, you want to sample from a latent space $z \sim p(z)$ and then decode to generate new data
In a normal autoencoder, random latent variables are usually nonsense because the encoder only used tiny isolated regions of latent space–most of the latent space was never trained on
A regular autoencoder learns deterministic encoding and decoding but CANNOT answer how likely an image is to be generated by a model

Mental Model

A VAE can be considered as:

Encoder: compress input into a Gaussian distribution
Sample: pick a latent vector from the Gaussian distribution
Decoder: reconstruct input from the sampled latent vector
Loss: Balance reconstruction accuracy and latent space regularity

Intractability

In generative modeling, the observed data $x$ is generated by hidden latent variables $z$
Given training data, we want to find the parameter $\theta$ that maximizes the probability of our data
To compute the probability of an observed image $x$, we must consider all possible latent variables that could have produced it

$$ p_{\theta}(x) = \int p_{\theta}(x \mid z) p_{\theta}(z) \, dz $$

However, this integral is impossible to compute because $z$ is high dimensional, $p_\theta$ is a neural network, and there is no algebraic structure to exploit
We would like to calculate the posterior $p_{\theta}(z \mid x)$ but by Bayes’ rule, $p_{\theta}(z \mid x) = \frac{p_{\theta}(x \mid z)p(z)}{p_\theta(x)}$, and the denominator is intractable

Variational Inference

Since we can’t calculate the true posterior $p_{\theta}(z \mid x)$, we can approximate it with a second distribution $q_{\phi}(z \mid x)$, parameterized by a neural network (the encoder)
- True posterior: $p_{\theta}(z \mid x)$ (unknown, complex)
- Approximate posterior: $q_{\phi}(z \mid x)$ (known, usually Gaussian, predicted by a neural network)

Evidence Lower Bound (ELBO)

We want to find the parameters $\theta$ which maximize the probability of our data $p_\theta(x)$

$$ p_{\theta}(x) = \int p_{\theta}(x \mid z) p_{\theta}(z) \, dz $$

$$ p_{\theta}(x) = \int p_{\theta}(x, z)dz $$

Since log is a monotonic function that keeps the optimum the same, but makes the math and optimization much easier we will maximize $\log(p_\theta(x))$

$$ \log p_{\theta}(x) = \log \int p_{\theta}(x, z)dz $$

Multiple and divide by $q_{\phi}(z \mid x)$

$$ \log p_{\theta}(x) = \log \int q_{\phi}(z \mid x) \frac{p_{\theta}(x, z)}{q_{\phi}(z \mid x)}dz $$

We know that for any probaiblity density $q(z)$

$$ \mathbb{E}_{q(z \mid x)}[f(z)] = \int q(z)f(z)dz $$

Then we can rewrite our integral as an expectation

$$ \int q_{\phi}(z \mid x) \frac{p_{\theta}(x, z)}{q_{\phi}(z \mid x)}dz = \mathbb{E}_{q_\phi(z \mid x)}\left[\frac{p_{\theta}(x, z)}{q_{\phi}(z \mid x)}\right] $$

$$ p_{\theta}(x) = \mathbb{E}_{q_\phi(z \mid x)}\left[\frac{p_{\theta}(x, z)}{q_{\phi}(z \mid x)}\right] $$

$$ \log p_{\theta}(x) = \log \mathbb{E}_{q_\phi(z \mid x)}\left[\frac{p_{\theta}(x, z)}{q_{\phi}(z \mid x)}\right] $$

$$ f(z) = \frac{p_{\theta}(x, z)}{q_{\phi}(z \mid x)} $$

$$ \log p_{\theta}(x) = \log \mathbb{E}_{q_\phi(z \mid x)}[f(z)] $$

Since log is concave, we can use Jensen’s inquality

$$ \log \mathbb{E}[f(z)] \geq \mathbb{E}[\log f(z)] $$

$$ \log \mathbb{E}_{q_\phi(z \mid x)}[f(z)] \geq \mathbb{E}_{q_\phi(z \mid x)}[\log f(z)] $$

$$ \log p_{\theta}(x) \geq \mathbb{E}_{q_\phi(z \mid x)}[\log f(z)] $$

$$ \log p_{\theta}(x) \geq \mathbb{E}_{q_\phi(z \mid x)}\left[\log \frac{p_{\theta}(x, z)}{q_{\phi}(z \mid x)}\right] $$

We call this the evidence lower bound (ELBO)

$$ \mathcal{L}_{\theta,\phi;x} = \mathbb{E}_{q_\phi(z \mid x)}\left[\log \frac{p_{\theta}(x, z)}{q_{\phi}(z \mid x)}\right] $$

While $\log p_\theta(x)$ is intractable, $\mathcal{L}_{\theta,\phi;x}$ is a tractable lower bound we can compute and optimize

KL Divergence

We have the true posterior $p_{\theta}(z \mid x)$ and the approximate posterior $q_{\phi}(z \mid x)$
By definition

$$ D_{KL}(q_\phi(z \mid x) \| p_\theta(z \mid x)) = \mathbb{E}_{q_\phi(z \mid x)}\left[\log \frac{q_\phi(z \mid x)}{p_\theta(z \mid x)}\right] $$

Since KL divergence is always nonnegative, $KL \geq 0$
Baye’s rule says

$$ p_\theta(z | x) = \frac{p_\theta(x,z)}{p_\theta(x)} $$

Then

$$ KL = \mathbb{E}_{q_\phi(z \mid x)}\left[\log \frac{q_\phi(z \mid x)p_\theta(x)}{p_\theta(x,z)}\right] $$

$$ KL = \mathbb{E}_{q_\phi(z \mid x)} [\log q_\phi(z \mid x) + \log p_\theta(x) - \log p_\theta(x,z)] $$

Since $\log p_\theta(x)$ does not depend on $z$

$$ \mathbb{E}_{q_\phi(z \mid x)} [\log p_\theta(x)] = \log p_\theta(x) $$

$$ KL = \log p_\theta(x) + \mathbb{E}_{q_\phi(z \mid x)} [\log q_\phi(z \mid x)] - \mathbb{E}_{q_\phi(z \mid x)}[\log p_\theta(x,z)] $$

Rearranging

$$ \log p_\theta(x) = \mathbb{E}_{q_\phi(z \mid x)}[\log p_\theta(x,z)] - \mathbb{E}_{q_\phi(z \mid x)} [\log q_\phi(z \mid x)] + KL $$

$$ \log p_\theta(x) = \mathbb{E}_{q_\phi(z \mid x)}\left[\frac{\log p_\theta(x,z)}{\log q_\phi(z \mid x)}\right] + KL $$

$$ \log p_\theta(x) = \mathbb{E}_{q_\phi(z \mid x)}\left[\frac{\log p_\theta(x,z)}{\log q_\phi(z \mid x)}\right] + D_{KL}(q_\phi(z \mid x) \| p_\theta(z \mid x)) $$

The first term is exactly the ELBO!

$$ \mathcal{L}(x; \theta, \phi) = \mathbb{E}_{q_\phi(z \mid x)}\left[\frac{\log p_\theta(x,z)}{\log q_\phi(z \mid x)}\right] $$

$$ \log p_\theta(x) = \mathcal{L}(x; \theta, \phi) + D_{KL}(q_\phi(z \mid x) \| p_\theta(z \mid x)) $$

Because the KL divergence is always nonnegative, the ELBO is automatically a lower bound

$$ \mathcal{L}(x; \theta, \phi) \leq \log p_\theta(x) $$

And the gap between the ELBO and the true log-likelihood is exactly the error in the posterior!

$$ D_{KL}(q_\phi(z \mid x) \| p_\theta(z \mid x)) = 0 \to \log p_\theta(x) = \mathcal{L}(x; \theta, \phi) $$

So maximizing the ELBO data likelihood does two things simultaneously:
- 1) Increases data likelihood
- 2) Makes encoder approximate the true posterior

VAE Loss

$$ \log p_\theta(x) = \mathcal{L}(x; \theta, \phi) + D_{KL}(q_\phi(z \mid x) \| p_\theta(z \mid x)) $$

This form is still unusable since

$$ p_\theta(z \mid x) = \frac{p_\theta(x \mid z)p(z)}{p_\theta(x)} $$

$$ p_\theta(x) = \int p_{\theta}(x \mid z)p(z)dz $$

So we’ll expand the KL term

$$ D_{KL}(q_\phi(z \mid x) \| p_\theta(z \mid x)) = \mathbb{E}_{q_\phi} \left[\log \frac{q_\phi(z \mid x)}{p_\theta(z \mid x)} \right] $$

$$ KL = \mathbb{E}_{q_\phi} \left[\log \frac{q_\phi(z \mid x)}{p_\theta(z \mid x)} \right] $$

From Baye’s rule

$$ p_\theta(z \mid x) = \frac{p_\theta(x,z)}{p_\theta(x)} $$

Substitute back into the KL equation

$$ KL = \mathbb{E}_{q_\phi} \left[\log \frac{q_\phi(z \mid x)p_\theta(x)}{p_\theta(x,z)} \right] $$

$$ KL = \mathbb{E}_{q_\phi} [\log q_\phi(z \mid x) + \log p_\theta(x) - \log p_\theta(x,z)] $$

$$ KL = \mathbb{E}_{q_\phi} [\log q_\phi(z \mid x)] - \mathbb{E}_{q_\phi} [\log p_\theta(x,z)] + \log p_\theta(x) $$

Recall from earlier

$$ \log p_\theta(x) = ELBO + KL $$

$$ \log p_\theta(x) = ELBO + \mathbb{E}_{q_\phi} [\log q_\phi(z \mid x)] - \mathbb{E}_{q_\phi} [\log p_\theta(x,z)] + \log p_\theta(x) $$

Rearrange for ELBO

$$ ELBO = \mathbb{E}_{q_\phi} [\log p_\theta(x,z)] - \mathbb{E}_{q_\phi} [\log q_\phi(z \mid x)] $$

$$ p_\theta(x,z) = p_\theta(x \mid z)p_\theta(z) $$

$$ ELBO = \mathbb{E}_{q_\phi} [\log p_\theta(x \mid z)p_\theta(z)] - \mathbb{E}_{q_\phi} [\log q_\phi(z \mid x)] $$

$$ ELBO = \mathbb{E}_{q_\phi} [\log p_\theta(x \mid z)] + \mathbb{E}_{q_\phi} [\log p_\theta(z)] - \mathbb{E}_{q_\phi} [\log q_\phi(z \mid x)] $$

$$ ELBO = \mathbb{E}_{q_\phi} [\log p_\theta(x \mid z)] + \mathbb{E}_{q_\phi} \left[\log \frac{p_\theta(z)}{q_\phi(z \mid x)}\right] $$

From the definition of KL divergence

$$ \mathbb{E}_{q_\phi} \left[\log \frac{p_\theta(z)}{q_\phi(z \mid x)}\right] = -KL(q_\phi(z \mid x)||p(z)) $$

$$ ELBO = \mathbb{E}_{q_\phi(z \mid x)} [\log p_\theta(x \mid z)] - KL(q_\phi(z \mid x)||p(z)) $$

Reparameterization Trick

The encoder’s job is to output two numbers: a mean $\mu$ and standard deviation $\sigma$, in essence spitting out a cloud of probability in which the image lives
We sample a point $z$ from that probability cloud
The decoder takes $z$ and tries to rebuild the image
To steer the reconstruction in the right direction, gradients need to be propagated backwards
However, the encoder didn’t output $z$ directly, only $\mu$ and $\sigma$
If we treat sampling as a black box operation of sampling $z$ from $\mathcal{N}(\mu, \sigma^2)$, there is no mathematical link between $z$ and $\mu$, breaking the chain rule
Instead of sampling $z$ directly from $\mathcal{N}(\mu, \sigma^2)$, we express $z$ as a deterministic transformation of noise $\epsilon$:

$$ z = \mu + \sigma \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I) $$

Now $z$ is a function of parameters $(\mu, \sigma)$ and fixed noise $\epsilon$, so we can take gradients with respect to $\mu$ and $\sigma$ while treating $\epsilon$ as a constant

Architecture

Input: Data point $x$
Encoder (Recognition Model): Neural network outputs parameters $\mu$ and $\log(\sigma^2)$
Latent Space: Apply reparameterization trick $z = \mu + \sigma \odot \epsilon$
Decoder (Generative Model): Neural network takes $z$ and outputs parameters to reconstruct $x$ (e.g. pixels)
Loss: Calculate ELBO and backpropagate to update weights in both encoder and decoder

Residual Networks

2026-05-20T00:00:00+00:00

Deep Residual Learning for Image Recognition

Learning residual functions significantly improves quality when training deep neural networks

Overview

Deep convolutional neural networks have led to major breakthroughs in image classification, in large part due to increasing the depth of the network
However, there are two issues when simply stacking more layers
- Vanishing gradients
- Degradation
Vanishing Gradients
- Vanishing gradients refer to when the gradients of earlier layers become very close to zero, leading to updates of the weights also becoming very close to zero
- This leads to:
  - Stagnant learning in the earlier layers, which are crucial for learning fundamental features (edges, patterns, etc.)
  - Poor feature extraction and learning in the deeper layers since they build upon features learned in the early layers
- The vanishing gradient problem often arises when using certain activation functions like the sigmoid function or hyperbolic tangent (tanh) function
- For large positive or negative inputs to the sigmoid function, the sigmoid function “saturates” near 0 or 1, where the derivative is a very small positive number
- For large positive or negative inputs to the tanh function, the tanh function “saturates” near -1 or 1, where the derivative is a very small positive number.
- During backpropagation, the local gradients are continually multiplied together, shrinking the values towards zero
- Vanishing gradients are largely solved by techniques like:
  - Rectified Linear Unit (ReLU) activation function whose derivative is 1 for all positive inputs
    - $ReLU(x) = max(0, x)$
  - Normalized initialization and intermediate normalization layers
  - Residual networks (this paper)
Degradation
- As the depth of the network increases, accuracy saturates and then degrades rapidly
- Unexpectedly, this is not caused by overfitting, and adding more layers leads to higher training error

Residual Networks

Instead of hoping a few stacked layers directly fit a desired underlying mapping, we let these layers fit a residual mapping
- Formally, let the underlying desired mapping be $\mathcal{H}(x)$
- Then the stacked nonlinear layers fit a mapping of $\mathcal{F}(x) \coloneqq \mathcal{H}(x) - x$
- The original mapping is now recast into $\mathcal{F}(x) + x$
The hypothesis is that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping
To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers
For the implementation, we use shortcut connections which simply perform an identity mapping, and their outputs are added to the outputs of the stacked layers
Identity shortcut connections add neither extra parameter nor computation complexity

U-Net

2026-05-19T00:00:00+00:00

U-Net: Convolutional Networks for Biomedical Image Segmentation

The U-Net is an architecture for convolutional neural networks consisting of an encoder and decoder with skip connections

Overview

Previous architectures of convolutional networks failed to preserve high-resolution details
In a standard encoder-decoder network, data is compressed into a bottleneck where spatial information is lost in favor of high-level semantic meaning
The breakthrough with this paper was the introduction of skip connections, which skip the bottleneck by concatenating feature maps from the encoding path directly to the feature maps in the decoding path
- Preserves high-resolution information of the input image from the encoding path, so the model no longer has to guess about high-resolution details in the decoding path
- Enables fusion of features so the model can leverage both high-level and low-level information
- Improves gradient flow by propagating gradients from output layer back to earlier layers

Encoder

Repeated application of:
- Two $3 \times 3$ convolutions, each followed by ReLU
- $2 \times 2$ max pooling operation with stride 2 for downsampling
- Double the number of feature maps at each downsampling stack

Decoder

Repeated application of:
- $2 \times 2$ transposed convolution with stride 2 for upsampling
- Concatenation with cropped feature maps from contracting path
- Channel reduction via 3D filters of dimensions $W \times D \times C$ (width, depth, channels)
  - Act as weighted combinations of features from multiple channels
- Two $3 \times 3$ convolutions, each followed by ReLU

Breakdown

Concretely in the encoding path of the image above:
- Start with a grayscale image $572 \times 572 \times 1$
- Apply 64 filters of dimensions $3 \times 3 \times 1$ (& ReLU) $\to 570 \times 570 \times 64$
- Apply 64 filters of dimensions $3 \times 3 \times 64$ (& ReLU) $\to 568 \times 568 \times 64$
- Apply $2 \times 2$ max pooling with stride 2 $\to 284 \times 284 \times 64$
- Apply 128 filters of dimensions $3 \times 3 \times 64$ (& ReLU) $\to 282 \times 282 \times 128$
Concretely in the decoding path of the image above:
- Start at the bottleneck with a feature map of dimensions $28 \times 28 \times 1024$
- Apply 512 transposed filters of dimensions $2 \times 2 \times 1024 \to 56 \times 56 \times 512$
- Concatenate a cropped feature map of $56 \times 56 \times 512 \to 56 \times 56 \times 1024$
- Apply 512 filters of dimensions $3 \times 3 \times 1024$ (& ReLU) $\to 54 \times 54 \times 512$
- Apply 512 filters of dimensions $3 \times 3 \times 512$ (& ReLU) $\to 52 \times 52 \times 512$
- Apply 256 transposed filters of dimensions $2 \times 2 \times 512 \to 104 \times 104 \times 256$

Summary

Encoding path:
- Spatial dimensions decrease $\to$ loses precise locations & gains global context
- Channel dimensions increase $\to$ gains complex concept detection
Decoding path:
- Spatial dimensions increase $\to$ recovers spatial resolution
- Channel dimensions decrease $\to$ compresses abstract concepts back into pixels

Convolutional Neural Networks

2026-05-17T00:00:00+00:00

An Introduction to Convolutional Neural Networks

Convolutional neural networks (CNNs) are a type of neural network which use filters glided across the input to detect patterns in images, video, audio, etc.

Advantages

Reduces the number of input nodes (less computation and training time)
Tolerates small shifts of an image (less overfitting)
Takes advantage of local spatial correlations in images

Layers

Convolutional layers: convolve a kernel with the input image
Pooling layers: downsample the spatial dimensionality
Fully-connected layers: traditional neural network architecture

Convolutional Layer

Extracts features from an input image using filters (or kernels):
- Small in spatial dimensionality and are overlaid on top of the image with dimensions $(F * F * D_i)$
  - $D_i = 3$ for RGB channels
  - $D_i = N$ for the number of filters from the previous layer
- Glided through the input, where the output is the dot product of the input and filter with the addition of a bias term
- Feature maps are then typically fed through a ReLU function: $f(x) = \max(x,0)$
The weights and bias of the kernel are adjusted via backpropagation during training
Each cell in a feature map therefore corresponds to a group of neighboring pixels and the network will learn kernels that fire when they see a specific feature at a given spatial position
Convolutional layers can significantly reduce the complexity of a model through optimizing three hyperparameters:
- Depth: Number of different kernels convolved across the input volume
- Stride: How many pixels the filter shifts over the input volume at each step of the convolution
- Padding: Extra rows and columns of zeros added to the border of the input image or feature map
  - Prevents the feature map from shrinking too rapidly (for deeper networks)
  - Preserves information from pixels on the edge

Pooling Layer

Reduces the dimensionality of the representation by computing some aggregated value over each feature map
- In max pooling, take the largest value in the feature map over the covered area
The outputs of pooling layers are typically the input of a traditional neural network by flattening the 2D feature maps into a single vector of values and concatenating multiple feature maps