Variational Autoencoder
Variational Autoencoder (VAE) is a generative model that combine autoencoder with variational inference. It is the first deep learning model that is capable of generating new data samples with high quality.
1. Autoencoder
An autoencoder is a type of neural network that is trained to reconstruct its input. It consists of two main components: an encoder that compresses the input into a lower-dimensional \((\mathbb{R}^d)\) representation, and a decoder that reconstructs the original input from this representation. The goal of training an autoencoder is to minimize the difference between the input and the reconstructed output.
It is important to note that traditional autoencoders are mainly used for dimensionality reduction and denoising. While they can learn meaningful latent representations of input data, they are not designed to generate high-quality new data samples by simply sampling a point from the \(\mathbb{R}^d\) latent space and passing it through the decoder.
The training loss of AE is as follow:
\[\mathcal{L}_{AE} = \mathbb{E}_{x \sim p(x)} \left[ \| x - \theta(\phi(x)) \|^2 \right]\]where \(\theta\) and \(\phi\) are the parameters of the decoder and encoder networks, respectively.
2. Variational Autoencoder
Variational Autoencoder (VAE) is an extension of the traditional autoencoder that incorporates variational inference. The key idea behind VAE is to constrain the latent space to follow a specific distribution, typically a multivariate Gaussian distribution. This allows for better sampling and generation of new data points.
The training loss of VAE is as follow:
\[\mathcal{L}_{VAE} = - [\mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) || p(z))],\]where \(q_{\phi}(z|x)\) is the encoder, \(p_\theta(x|z)\) represents the decoder, and \(p(z)\) is the prior distribution (usually a standard Gaussian).
2.1 Understanding VAE (Take 1, Intuition)
Since the traditional autoencoder does not impose any constraints on the latent space, if we want to sample a vector from the latent \(\mathbb{R}^d\) space, the vector will be arbitrary and may not correspond to any meaningful data point. This can lead to poor generalization and unrealistic samples when generating new data. In other words, an autoencoder is primarily a compression model rather than a generative model.
So, the key innovation of VAE is to introduce a probabilistic constraint into the latent space representation. It allows us to sample from the latent space in a meaningful way, ensuring that the generated samples are more realistic and diverse.
To implement this, VAE would face two main challenges:
-
1) VAE should parameterize the latent space distribution directly, and ensure the parameters (mean and variance) are learned from the data.
-
2) VAE should constrain the learned latent space distribution to be close to the prior distribution (usually a standard Gaussian).
For the first challenge, VAE uses a neural network (the encoder) to learn the parameters of the latent space distribution. Specifically, for each input data point \(x\), the encoder outputs the mean \(\mu_\phi(x)\) and variance \(\sigma^2_\phi(x)\) of the latent variable \(z \sim \mathcal{N}(\mu_\phi(x), \sigma^2_\phi(x))\).
This allows us to sample \(z\) from the learned distribution \(q_\phi(z|x) = \mathcal{N}(z; \mu_\phi(x), \sigma^2_\phi(x))\). Note that, as the sampling operation is not differentiable, VAE uses the well-known reparameterization trick to enable backpropagation through the sampling process, that is, it samples from a standard Gaussian distribution and then transforms the sample using the learned mean and variance:
\[z = \mu_\phi(x) + \sigma_\phi(x) \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I),\]where \(\frac{\partial{\mathcal{L}_{VAE}}}{\partial{\phi}} = \frac{\partial{\mathcal{L}_{VAE}}}{\partial{z}} \cdot \frac{\partial{z}}{\partial{\phi}},\) the two terms are analytically tractable, allowing for efficient gradient computation.
For the second challenge, VAE uses the Kullback-Leibler (KL) divergence to measure the difference between the learned latent space distribution \(q_\phi(z|x)\) and the prior distribution \(p(z)=\mathcal{N}(0, I)\). The KL divergence term in the VAE loss function encourages the learned distribution to be close to the prior, ensuring that the latent space is well-structured and allows for meaningful sampling. The KL divergence can be computed analytically for Gaussian distributions, leading to a closed-form expression:
\[\begin{aligned} D_{KL}(q_\phi(z|x) || p(z)) &= \int_z q_\phi(z|x) \log \frac{q_\phi(z|x)}{p(z)} dz \\ &= \int_z p_{\mathcal{N}(\mu_\phi(x), \sigma^2_\phi(x))}(z) \cdot \log \frac{\frac{1}{\sqrt{2\pi \sigma^2_\phi(x)}} \exp\left(-\frac{(z - \mu_\phi(x))^2}{2\sigma^2_\phi(x)}\right)}{\frac{1}{\sqrt{2\pi}} \exp\left(-\frac{z^2}{2}\right)} dz \\ & = \int_z p_{\mathcal{N}(\mu_\phi(x), \sigma^2_\phi(x))}(z) \cdot \frac{1}{2} \cdot \left[ -\log \sigma^2_\phi(x) - \frac{(z - \mu_\phi(x))^2}{\sigma^2_\phi(x)} + z^2 \right] dz \\ &= \frac{1}{2} \cdot \left[ -\log \sigma^2_\phi(x) \cdot \int_z p_{\mathcal{N}(\mu_\phi(x), \sigma^2_\phi(x))}(z) dz + \int_z p_{\mathcal{N}(\mu_\phi(x), \sigma^2_\phi(x))}(z) z^2 dz - \int_z p_{\mathcal{N}(\mu_\phi(x), \sigma^2_\phi(x))}(z) \frac{(z - \mu_\phi(x))^2}{\sigma^2_\phi(x)} dz \right] \\ &= \frac{1}{2} \cdot \left( -\log{\sigma^2_\phi(x)} + \mathbb{E}_{z \sim q_\phi(z|x)} [z^2] - \mathbb{E}_{z \sim q_\phi(z|x)}[\frac{(z - \mu_\phi(x))^2}{\sigma^2_\phi(x)}] \right)\\ &= \frac{1}{2} \cdot \left( -\log{\sigma^2_\phi(x)} + (\sigma^2_\phi(x) + \mu_\phi^2(x)) - 1 \right ) \end{aligned}\]In summary, compared to the reconstruction loss of a standard autoencoder, the VAE loss introduces an additional regularization term. This regularization encourages the learned latent space distribution to align with the chosen prior (typically a standard Gaussian).
The first term in the VAE loss function acts as a reconstruction loss—if we assume \(p(x|z)\) is a Gaussian with mean \(\theta(z)\) and fixed variance, maximizing the likelihood is equivalent to minimizing the reconstruction error, just like in a regular autoencoder. The second term, the KL divergence, serves as a regularizer that shapes the latent space, ensuring that samples drawn from the prior distribution can be decoded into realistic data. This combination enables VAEs to generate new, meaningful samples rather than simply compressing and reconstructing input data.
2.2 Understanding VAE (Take 2, ELBO)
From the generative perspective, generative models aim to maximize the log-likelihood \(\log p_\theta(x)\).
By introducing a variational distribution \(q(z|x)\), we can use the ELBO (Evidence Lower Bound) to approximate this log-likelihood:
\[\begin{aligned} \log p_\theta(x) &= \mathbb{E}_{q_\phi(z|x)} \left[\log (\frac{p_\theta(x,z)}{p_\theta(z|x)}\right]) \\ & = \mathbb{E}_{q_\phi(z|x)} \left[\log (\frac{p_\theta(x,z)}{q_\phi(z|x)} \cdot \frac{q_\phi(z|x)}{p_\theta(z|x)})\right] \\ & = \mathbb{E}_{q_\phi(z|x)} \left[\log (\frac{p_\theta(x,z)}{q_\phi(z|x)})\right] + \mathbb{E}_{q_\phi(z|x)} \left[\log \frac{q_\phi(z|x)}{p_\theta(z|x)}\right] \\ &= \underbrace{\mathbb{E}_{q_\phi(z|x)} \left[\log (\frac{p_\theta(x,z)}{q_\phi(z|x)})\right]}_{\text{ELBO}} + \underbrace{D_{KL}(q_\phi(z|x) || p_\theta(z|x))}_{\text{KL Divergence}}\\ &\ge \text{ELBO}, \end{aligned}\]where ELBO can be further expressed as:
\[\begin{aligned} \text{ELBO} &= \mathbb{E}_{q_\phi(z|x)} \left[\log (\frac{p_\theta(x,z)}{q_\phi(z|x)})\right] \\ &= \mathbb{E}_{q_\phi(z|x)} \left[\log (\frac{p_\theta(x|z)\cdot p(z)}{q_\phi(z|x)})\right] \\ &= \mathbb{E}_{q_\phi(z|x)} \left[ \log p_\theta(x|z) \right ] - \mathbb{E}_{q_\phi(z|x)} \left[\log (\frac{q_\phi(z|x)}{p(z)})\right]\\ &= \mathbb{E}_{q_\phi(z|x)} \left[ \log p_\theta(x|z) \right ] - D_{KL}(q_\phi(z|x) || p(z)). \end{aligned}\]We can see that, maximizing the ELBO is equivalent to minimizing the VAE loss:
\[\mathcal{L}_{VAE} = -\text{ELBO} = -\mathbb{E}_{q_\phi(z|x)} \left[ \log p_\theta(x|z) \right ] + D_{KL}(q_\phi(z|x) || p(z)).\]Note that, unlike the EM algorithm which alternates between tightening the bound and maximizing it, the VAE simultaneously learns to approximate the posterior and maximize the likelihood through a single, joint optimization process.