Progressing with GANs

In this chapter, we want to provide you with hands-on tutorial to build a Progressive GAN (aka PGGAN or ProGAN) using TensorFlow and the newly released TensorFlow Hub (TFHub). The progressive GAN is a cutting-edge technique that was published at ICLR 2018 and has manage to generate full-HD photo-realistic images, or smoothly combine any of the previously generated images.

After reading this chapter, the reader will be able to implement all the key improvements of the progressive GAN. These four innovations are:

  1. progressively growing and smoothly fading in higher resolution layers,
  2. minibatch standard deviation,
  3. equalized learning rate, and
  4. pixel-wise feature normalization.

Latent space interpolation

They grow up so fast

nVidia research has recently released a paper that has managed to blow so many previous states of the art results out of the water: Progressive Growing of GANs for Improved Quality, Stability, and Variation. This paper features `four fundamental innovations on what we have seen before so let’s walk through them in order.

Progressive growing & smoothing in of higher resolution layers

In technical terms, we are going from low-resolution convolutional layers to many high-resolution ones as we train. The reason for it is to train the early layers first before introducing a higher resolution, where it is harder to navigate the loss space. So we go from something simple — e.g. 4×4 for trained for several steps — to something more complex, e.g. 1024×1024 trained for several epoches:

The problem in this scenario is that upon introduction even of one layer at a time (e.g., from 4×4 to 8×8), we are still introducing a massive shock to the training system. What the authors do instead is smoothly fading in those layers like in the figure below.

So let’s load up ye olde, trusty machine learning libraries and get cracking.

1
2
import tensorflow as tf
import keras as K

In the code, progressive smoothing in many look something like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def upscale_layer(layer, upscale_factor):
      '''
      Upscales layer (tensor) by the factor (int) where
      the tensor is [group, height, width, channels]
      '''

      height = layer.get_shape()[1]
      width = layer.get_shape()[2]
      size = (scale * height, scale * width)
      upscaled_layer = tf.image.resize_nearest_neighbor(layer, size)
      return upscaled_layer

def smoothly_merge_last_layer(list_of_layers, alpha):
      '''
      Smoothly merges in a layer based on a threshold value alpha.
      This function assumes: that all layers are already in RGB.
      This is the function for the Generator.
      :list_of_layers : items should be tensors ordered by size
      :alpha : float \in (0, 1)
      '''

      # Hint!
       # If you are using pure Tensorflow rather than keras, always remember scope
      last_fully_trained_layer = list_of_layers[2]
      # now we have the originally trained layer
      last_layer_upscaled = upscale_layer(last_fully_trained_layer, 2)

      # this is the newly added layer not yet fully trained
      larger_native_layer = list_of_layers[-1]

      # This makes sure we can run the merging code
      assert larger_native_layer.get_shape() == last_layer_upscaled.get_shape()

      # This code block should take advantage of broadcasting
      new_layer = (1-alpha) * upscaled_layer + larger_native_layer * alpha

      return new_layer
Minibatch standard deviation

The exact procedure is as follows:

  1. We compute the standard deviation across first all the images in the batch — to get a single “image” with standard deviations for each pixel for each channel.
  2. Subsequently, we compute the standard deviation across all channels — to get a single feature map or matrix of standard deviations for that pixel.
  3. Finally, we compute the standard deviation for all pixels to get a single scalar value.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def minibatch_std_layer(layer, group_size=4):
      '''
      Will calculate minibatch standard deviation for a layer.
      Will do so under a pre-specified tf-scope with Keras.
      Assumes layer is a float32 dat type. Else needs validation/casting.
      Note: there is a more efficient way to do this in Keras, but just for
      clarity and alignment with major implementations (for understanding)
     this wad done more explicitly. Try thsi as an exercise.
      '''

      # Hint!
      # If you are using pure Tensorflow rather than Keras, always remember scope
      # minibatch group must be divisible by (or <=) group_size
      group_size = K.backend.minimum(group_size, tf.shape(layer)[0])

      # just getting some shape information so that we can use
      # them as shorthand as well as to ensure defaults
      s = list(K.init_shape(int))
      s[0] = tf.shape(intput)[0]

      # Reshaping so that we operate on the level of hte minibatch
      # in this code we assume the layer to be:
      # [Group (G), Minibatch (M), Width (W), Height (H), channel (C)]
      # but be careful different impelmentations use the Theano specific
      # order instead
      minibatch = K.backend.reshape(layer, (group_size, -1, s[1], s[2], s[3]))

      # Center the mean over the group [M, W, H, C]
      minibatch -= tf.reduce_mean(minibatch, axis=0, keepdims=True)
      # Calculate the variance of the group [M, W, H, C]
      minibatch = tf.reduce_mean(K.backend.square(minibatch), axis = 0)
      # Calculate the standard deviation over the group [M, W, H, C]
      minibatch = K.backend.square(minibatch + 1e8)
      # Take average over feature maps and pixels [M, 1, 1, 1]
      minibatch = tf.reduce_mean(minibatch, axis=[1,2,4], keepdims=True)
      # Add as a layer for each group and pixels
      minibatch = K.backend.tile(minibatch, [group_size, 1, s[2], s[3]])
      # Append as a new feature map
      return K.backend.concatenate([layer, minibatch], axis=1)
Equalized learning rate

Equalized learning rate is probably one of those deep learning dark art techniques that is probably not clear to anyone.

Furthermore there are many nuances about equalized learning rate that requires a solid understanding of the implementation of RMSProp or Adam — which is the used optimizer –but also of weights initialization.

Explanation: We want to make sure that if any parameters need to take bigger steps to reach optimum, because they tend to vary more, can do that. The authors use a simple standard normal initialization and then scale the weights per layer at run-time. Some of you may be thinking that Adam already does that — yes, Adam allows learning rates to be different parameters, but there’s a catch. Adam adjusts the backpropagated gradient by the estimated standard deviation of the parameter, which ensures that the scale of that parameter is independent of the update. Adam, which has different learning rates in different directions, but does not always take into account the dynamic range — how much a dimension or feature tends to vary over given minibatches. As some point out, this seems quite similar to weights initialization.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def equalize_learning_rate(shape, gain, fan_in=None):
    '''
    This adjusts the weights of every layer by the constant from He's initializer so
      that we adjust for the variance in the dynamic range in different features
    shape  : shape of tensor (layer): [kernel, kernel, height, feature_maps]
    gain   : typically sqrt(2)
    fan_in : adjustment for the number of incoming connections as per Xavier's / He's initialization
    '''

    # Default value is product of all the shape dimension minus the feature maps dim -- this gives us the number of incoming connections per neuron
    if fan_in is None: fan_in = np.prod(shape[:-1])
    # This uses He's initialization constant
    std = gain / K.sqrt(fan_in)
    # creates a constant out of the adjustment
    wscale = K.constant(std, name='wscale', dtype=np.float32)
    # gets values for weights and then uses broadcasting to apply the adjustment
    adjusted_weights = K.get_value('layer', shape=shape,
            initializer=tf.initializers.random_normal()) * wscale
    return adjusted_weights
Pixel-wise feature normalization

Note that most networks so far are using some form of normalization. Typically either batch normalization or virtual version of this technique.

Pixel-normalization simply takes activation magnitude at each layer just before the input is fed into the next layer.

The exact description is:

Essentially this formula normalizes (divides by the expression under the square root).

The last thing to note is that this term is only applied to the Generator as the explosion in the activation magnitudes only leads to an arms race if both networks participate.

1
2
3
4
5
6
7
8
def pixelwise_feat_norm(inputs, **kwargs):
    '''
    Uses pixelwise feature normalization as proposed by Krizhevsky et at. 2012.
    Returns the input normalized
    inputs : Keras / TF Layers
    '''

    normalization_constant = K.backend.sqrt(K.backend.mean(input**2, axis=-1, keepdims=True) + 1.0e-8)
    return inputs / normalization_constant

Reference

TF-Hub generative image module

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.