Paper Daily: An Introduction to Image Synthesis with Generative Adversarial Nets

Among the many applications of GAN, image synthesis is the most well-studied one, and research in this area has already demonstrated the great potential of using GAN in image synthesis. In this paper, the author provide a taxonomy of methods used in image synthesis, review different models for text-to-image synthesis and image-to-image translation, and discuss some evaluation metrics as well as possible future research directions in image synthesis with GAN.


The main goal of this paper is to provide an overview of the methods used in image synthesis with GAN and point out strengths and weaknesses of current methods. We classify the main approaches in image synthesis into three methods, i.e., direct methods, hierarchical methods, and iterative methods.

GAN Preliminaries

Generative Adversarial Net (GAN) consists of two separate neural networks: a generator G that takes a random noise vector z, and outputs synthetic data G(z); a discriminator D that takes an input x or G(z) and output a probability D(x) or D(G(z)) to indicate whether it is synthetic or from the true data distribution.

Both of the generator and discriminator can be arbitrary neural networks. The first GAN uses fully connected layer as its building block. Later, DCGAN proposes to use fully convolutional neural networks which achieve better performance, and since then convolution and transposed convolution layers have become the core components in many GAN models.

The original way to train the generator and discriminator is to form a two-player min-max game where the generator G tries to generate realistic data to fool the discriminator while discriminator D tries to distinguish between real and synthetic data. The value function to be optimized is shown in Equation 1, where p_{data}(x) denotes the true data distribution and p_z(z) denote the noise distribution.

However, when the discriminator is trained much better than the generator, D can reject the samples from G with confidence close to 1, and thus the loss \log(1 – D(G(z))) saturates and G can not learn anything from zero gradient.

Conditional GAN

In the original GAN, we have no control of what to be generated, since the output is only dependent on random noise. However, we can add a conditional input c to the random noise z so that the generated image is defined by G(c,z). Typically, the conditional input vector c is concatenated with the noise vector z, and the resulting vector is put into the generator as it is in the original GAN. Besides, we can perform other data augmentation on c and z. The meaning of conditional input c, for example, it can be the class of image, attributes of object or an embedding of text descriptions of the image we want to generate.

GAN with Auxiliary Classifier

In order to feed more side-information and to allow for semi-supervised learning, one can add an additional task-specific auxiliary classifier to the discriminator, so that the model is optimized on the original tasks as well as the additional task. The architecture of such method is illustrated in Figure 2, where C is the auxiliary classifier. Adding auxiliary classifiers allows us to use pre-trained models (e.g. image classifiers trained on ImageNet), and experiments in AC-GAN demonstrate that such method can help generating sharper images as well as alleviate the model collapse problem. Using auxiliary classifiers can also help in applications such as text-to-image synthesis and image-to-image translation.

GAN with Encoder

Although GAN can transform a noise vector z into a synthetic data sample G(z), it does not allow inverse transformation. If we treat the noise distribution as a latent feature space for data samples, GAN lacks the ability to map data samples x into latent feature z. In order to allow such mapping, two concurrent works BiGAN and ALI propose to add an encoder E in the original GAN framework, as shown in Figure 3.

Let \Omega_x be the data space and \Omega_z be the latent feature space, the encoder E takes x \in \Omega_x as input and produce a feature vector E(x) \in \Omega_z as output. The discriminator D is modified to take both a data sample and a feature vector as input to calculate P(Y|x, z), where Y = 1 indicates the sample is real and Y = 0 means the data is generated by G.

The objective is thus defined as:

GAN with Variational Auto-Encoder

VAE-GAN proposes to combine Variational Auto-Encoder (VAE) with GAN to exploit both of their benefits, as GAN can generate sharp images but often miss some modes while images produced by VAE are blurry but have large variety. The architecture of VAE-GAN is shown in Figure 4.

The VAE part regularize the encoder E by imposing a prior of normal distribution (e.g., z \sim N(0,1)), and the VAE loss term is defined as:

Also, VAE-GAN proposes to represent the reconstruction loss of VAE in terms of the discriminator D. Let D_l(x) denotes the representation of the l-th layer of the discriminator, and a Gaussian observation model can be defined as:

where \bar{x} \sim G(z) is a sample from the generator, and I is the identify matrix. So the new VAE loss is:

which is then combined with the GAN loss defined in Equation 1. Experiments demonstrate the VAE-GAN can generate better images than VAE or GAN alone.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.