
- All of these generative models ultimately derive from Maximum Likelihood, at least implicitly.
- The variational autoencoder sits in the “explicit” part of the tree. Remember that we had a clear loss function (the reconstruction loss)? Well, with GANs we do not have it anymore. Rather, we now have two competing loss functions that we will cover in lot more depth later. But as such, the system does not have a single analytical solution.
The key idea is that we are moving away from explicit and tractable into the territory of implicit — although directly estimated — approaches towards training.
Evaluation
The two most commonly used and accepted metrics for statistically evaluating the quality of the generated samples are the Inception Score (IS) and Frechet Inception Distance (FID). The advantage of those two scores is that they have been extensively validated to be highly correlated with at least some desirable property like a visual appeal or “realism” of the image. The Inception Score was designed solely around the idea that the samples should be recognisable but it is also shown to correlate with the human intuition of what constitutes real image as validated by Amazon Mechanical Turkers.
Inception Score
- The generated samples would distinguishably look like *some* real thing e.g. buckets or cows. This means the samples look real and that we are able to generate samples of items in our dataset.
- The generated samples should be varied and contain ideally all of the classes that were represented in the original dataset.
Although we might have further requirements of our generative model, this is a good start. The Inception Score (IS) was first introduced which extensively validated the metric and confirmed that it indeed correlates with human perceptions for what constitutes a high quality sample. The metric has since then become very popular in the GAN research community.
Technically, computing the IS uses an exponentiated Kullback-Leibler (KL) divergence between the real and the generated distribution, KL divergence as well as Jensen-Shannon divergence are generally regarded as what GANs are ultimately trying to minimize. These are both types of distance metrics that help us understand how different are two distributions in a high-dimensional space. There are some neat proofs connecting those divergences and the Min-max version of the GAN;
Frechet Inception Distance
The next problem to solve is the lack of variety of examples. Frequently, GANs only learn a handful of images for each class. The FID improves on the Inception Score by making it more robust to noise and allowing to detect intra-class sample omissions. This important, because if we accept IS’ baseline, then only producing one type of cate technically satisfies the cat-being-generated-sometimes requirement, but does not actually do what we want, e.g., if we had multiple breeds of cats represented. Furthermore, we want the GAN to output samples that present a cat from more than one angle and generally images that are quite distinct.
The mathematics of FID are agin quite complex, but the intuition is that we are looking for generated distribution of samples that minimizes the amount of modifications we have to do to the generated distribution to make the look like the distribution of the true data.
The FID is calculated by running through a number of images through Inception network (or “embedding them”). This means in practice we compare the intermediate representations — feature maps of the means, the variances and the covariances of the distributions — the real and the generated one.
To abstract away from images, if we have a domain of well-understood classifiers, we can use their predictions as a measure if this particular sample looks realistic. To summarise, the FID is just a way of abstracting away from a human evaluator and allow us to reason statistically — in terms of distributions.