On Conditional Variational Autoencoders
In this blog we go through the concept of CVAE and its applications. We will look at the probabilistic framework behind the model, its loss function, and how training/inference of a CVAE are conducted. The blog starts with revisiting concepts of autoencoder and variational autoencoder, to introduce the motivation of conditional autoencoders. Finally we discuss the relationship (similarities and differences) of CVAE and GAN.
I. Autoencoders and Variational Autoencoders Revisited
A. Autoencoders
Recall that autoencoder (AE) is a type of unsupervised learning framework that learns a latent representation (z) of input data (x). AEs are broadly used in feature extraction (representation learning) or data compression.
The goal of an AE is to reconstruct the input with high fidelity, and loss function is typically defined as minimizing the distance between the original data and reconstruction.
In training the encoder and decoder and jointly trained: input (x)->[encoder]->latent variable (z)->[decoder]->output (x); during inference, typically only encoder is used for tasks such as feature extraction.
B. Variational Autoencoders
AEs do not assume that inputs are sample from a certain distribution, that is, not formulating the problem from the probabilistic perspective. This under-assumption limits the application of AEs: it cannot be used for new data generation, because we don’t know what type of latent variable would be able to be decoded to “meaningful” data. Hence, the problem is raised: can we improve AEs to use the decoder part as a “data generator”?
The answer is yes, and the solution is Variational Autoencoder (VAE). Unlike AEs that assume a determinicstic encoding relationship between input and latent variable, VAE assumes the input follow a certain distribution \(p(x)\), and the latent space also follow a independent distribution p(z). Hence the encoder and decoder networks’ task become learning the conditional pdfs: \(p(z \vert x)\) and \(p(x \vert z)\).
Let’s begin the discussion of training and inference from a simple and classical case: assuming \(p(z \vert x)\) follows a Gaussian Distribution. In this sense, the encoder’s task is to learn the coefficients of the Gaussian: the mean \(\mu\) and covariance \(\Sigma\). Moreover, we use a prior, which is to assume that p(z) follows N(0,1) distribution (question that may likely raise here is, is it safe to make the prior assumption? would this assumption limits the learning ability of the encoder network?).
During training, the encoder gives a estimation of mean and covariance, and a latent variable is sampled from the esimated distribution. This latent variable is then fed to the decoder network.
The loss function should have a regularizer term in addition to the penalty of reconstruction loss in AEs: we use KL-divergence to evaluate how much information is lost when using q to represent a prior over z and encourages its values to be Gaussian [2]. (actually KL-divergence is one thing that I don’t quite understand yet)
During inference (for a generator net like VAE, rather we call this procedure “generation”), we simply sample from the prior distribution p(z) (here it is N(0,1)) and use the decoder to generate meaningful data x.
Introducing the prabability perspective is practically a huge step forward: AEs can merely be used for representation learning, while VAEs become a paradigm of “generator netowrk”, which means deep learning models have the ability to create new data without explicit input (the latent variable is completely randomly sampled from the prior during generation).
Small Meditation: Why would VAEs work?
The decoder network can be regarded as doing a Bayes Inference:
\[p_{\theta}(x \vert Z) = \frac{\int{p(x)} q_{\phi}(Z \vert x) dx}{p_{\theta}(Z)}\], which is a standatd posterior in Bayes statistics. So, to some extent it is explainable what functions and processes the encoder/decoder nets are trying to learn to fit. Moreover, one can claim that the VAE is trying to learn the latent distribution p(z), because there is closed form representation of p(z) and the encoder and decoder.
II. Conditional Variational Autoencoders
Now that we have VAEs, what’s its limitation? The problem is, it cannot generate meaningful data as human want: for example, when a VAE decoder learns how to generate human-like hand-written digital Arabian numbers, how can one guide the model to generate the handwritten version of a certain number (say, 5)? VAEs cannot do this task: it can only generate them, but don’t know what it is generating.
Intuitively, the desired kind of task is a supervised learning task, but unlike tasks like classification, the input and output are reversed: the desired label is Y (many literature denote the “label” as “condition”, and use C to represent, just keep in mind this Y is equivalent to that C, and is a random variable that is correlated with each X), the encoder learns it’s latent representation Z, and the generator net (the decoder) generates the desired data X based on Z. That is the high-level idea of how a Conditional VAE (CVAE) do generation. Let’s look at the details.
In a CVAE, we can regard that there are two encoders: encoder I and encoder II. Encoder I tries to learn the latent representation of the label Y and data X jointly: \(p(z \vert x,y)\), which is also a pdf as in VAE; encoder II tries to learn the conditional pdf: \(p(z \vert y)\) Decoder learns \(p(x \vert z,y)\).
During training, the input X and Y are both known data as input to encoder I, and for the decoder the Z is randomly sample from \(p(z \vert y)\) and Y is the same as input to encoder. (Is the conditional prior (encoder II) predefined/hard-coded/obtained someway/trained simutaneously with encoder I?)
During generation, firstly use encoder II to sample a conditioned Z based on Y, and use the decoder to inference X based on the Z and Y.
What is extra for CVAE on top of VAE?
Let’s examine what CVAE is adding to VAE:
- Encoder is conditioned jointly on both X and Y, while VAE encoder is only conditioned on X;
- The prior is another model representing distribution (predefined network or a simple pre-defined rule)
- The decoder represents a conditioned pdf of X on both Z (further generated by a conditioned prior as in encoder II),
III. CVAE and GAN
What is “variational”
Starting From the Calculus of Variationals and Lagranrian Method
How does Bayes Inference augment autoencoders?
Recall that in the most simple machine learning problem: regression with a linear model, formulating a maximum likelihood estimation problem is equivalent to least square estimation, with loss function being squared norm. And maximum a posterior estimation introduces a regularizer term in loss function. The idea of applying variational AE can be an analogy: by introducing Bayes model to AE framework also leads to a regularizer term (the KL divergence) in loss function. The high-level idea would be, if we give a probabilistic assumption to noise/latent variable, Bayes Inference could be used to do better fitting/generation.
References
- Deep Learning, Ian Goodfellow et al.
- Understanding Conditional Variational Autoencoders,
- Learning Structured Output Representation using Deep Conditional Generative Models,
- Tutorial on Variational Autoencoders
- [VAE paper] Auto-Encoding Variational Bayes
- [CVAE paper] Semi-supervised Learning with Deep Generative Models