In this chapter, we want to provide you with hands-on tutorial to build a Progressive GAN (aka PGGAN or ProGAN) using TensorFlow and the newly released TensorFlow Hub (TFHub). The progressive GAN is a cutting-edge technique that was published at ICLR 2018 and has manage to generate full-HD photo-realistic images, or smoothly combine any of the previously generated images.
After reading this chapter, the reader will be able to implement all the key improvements of the progressive GAN. These four innovations are:
- progressively growing and smoothly fading in higher resolution layers,
- minibatch standard deviation,
- equalized learning rate, and
- pixel-wise feature normalization.
Latent space interpolation

They grow up so fast
nVidia research has recently released a paper that has managed to blow so many previous states of the art results out of the water: Progressive Growing of GANs for Improved Quality, Stability, and Variation. This paper features `four fundamental innovations on what we have seen before so let’s walk through them in order.
Progressive growing & smoothing in of higher resolution layers
In technical terms, we are going from low-resolution convolutional layers to many high-resolution ones as we train. The reason for it is to train the early layers first before introducing a higher resolution, where it is harder to navigate the loss space. So we go from something simple — e.g. 4×4 for trained for several steps — to something more complex, e.g. 1024×1024 trained for several epoches:

The problem in this scenario is that upon introduction even of one layer at a time (e.g., from 4×4 to 8×8), we are still introducing a massive shock to the training system. What the authors do instead is smoothly fading in those layers like in the figure below.

So let’s load up ye olde, trusty machine learning libraries and get cracking.
[cce_python] import tensorflow as tf import keras as K [/cce_python]In the code, progressive smoothing in many look something like:
[cce_python] def upscale_layer(layer, upscale_factor): ”’ Upscales layer (tensor) by the factor (int) where the tensor is [group, height, width, channels] ”’ height = layer.get_shape()[1] width = layer.get_shape()[2] size = (scale * height, scale * width) upscaled_layer = tf.image.resize_nearest_neighbor(layer, size) return upscaled_layer def smoothly_merge_last_layer(list_of_layers, alpha): ”’ Smoothly merges in a layer based on a threshold value alpha. This function assumes: that all layers are already in RGB. This is the function for the Generator. :list_of_layers : items should be tensors ordered by size :alpha : float \in (0, 1) ”’ # Hint! # If you are using pure Tensorflow rather than keras, always remember scope last_fully_trained_layer = list_of_layers[2] # now we have the originally trained layer last_layer_upscaled = upscale_layer(last_fully_trained_layer, 2) # this is the newly added layer not yet fully trained larger_native_layer = list_of_layers[-1] # This makes sure we can run the merging code assert larger_native_layer.get_shape() == last_layer_upscaled.get_shape() # This code block should take advantage of broadcasting new_layer = (1-alpha) * upscaled_layer + larger_native_layer * alpha return new_layer [/cce_python]Minibatch standard deviation
The exact procedure is as follows:
- We compute the standard deviation across first all the images in the batch — to get a single “image” with standard deviations for each pixel for each channel.
- Subsequently, we compute the standard deviation across all channels — to get a single feature map or matrix of standard deviations for that pixel.
- Finally, we compute the standard deviation for all pixels to get a single scalar value.
Equalized learning rate
Equalized learning rate is probably one of those deep learning dark art techniques that is probably not clear to anyone.
Furthermore there are many nuances about equalized learning rate that requires a solid understanding of the implementation of RMSProp or Adam — which is the used optimizer –but also of weights initialization.
Explanation: We want to make sure that if any parameters need to take bigger steps to reach optimum, because they tend to vary more, can do that. The authors use a simple standard normal initialization and then scale the weights per layer at run-time. Some of you may be thinking that Adam already does that — yes, Adam allows learning rates to be different parameters, but there’s a catch. Adam adjusts the backpropagated gradient by the estimated standard deviation of the parameter, which ensures that the scale of that parameter is independent of the update. Adam, which has different learning rates in different directions, but does not always take into account the dynamic range — how much a dimension or feature tends to vary over given minibatches. As some point out, this seems quite similar to weights initialization.
[cce_python] def equalize_learning_rate(shape, gain, fan_in=None): ”’ This adjusts the weights of every layer by the constant from He’s initializer so that we adjust for the variance in the dynamic range in different features shape : shape of tensor (layer): [kernel, kernel, height, feature_maps] gain : typically sqrt(2) fan_in : adjustment for the number of incoming connections as per Xavier’s / He’s initialization ”’ # Default value is product of all the shape dimension minus the feature maps dim — this gives us the number of incoming connections per neuron if fan_in is None: fan_in = np.prod(shape[:-1]) # This uses He’s initialization constant std = gain / K.sqrt(fan_in) # creates a constant out of the adjustment wscale = K.constant(std, name=’wscale’, dtype=np.float32) # gets values for weights and then uses broadcasting to apply the adjustment adjusted_weights = K.get_value(‘layer’, shape=shape, initializer=tf.initializers.random_normal()) * wscale return adjusted_weights [/cce_python]Pixel-wise feature normalization
Note that most networks so far are using some form of normalization. Typically either batch normalization or virtual version of this technique.

Pixel-normalization simply takes activation magnitude at each layer just before the input is fed into the next layer.

The exact description is:

Essentially this formula normalizes (divides by the expression under the square root).
The last thing to note is that this term is only applied to the Generator as the explosion in the activation magnitudes only leads to an arms race if both networks participate.
