In this paper, the authors examine a collections of training procedure and model architecture refinements that improve model accuracy but barely change computational complexity. Many of them are minor “tricks” like modifying the stride size of a particular convolution layer or adjusting learning rate schedule. Collectively, however, they make a big difference.

## Training Procedures

### Baseline Training Procedure

The preprocessing pipelines between the training and validation are different. During training, the authors perform the following steps one-by-one:

- Randomly sample an image and decode it into 32-bit floating point raw pixel values in [0, 255].
- Random crop a rectangular region whose aspect ration is randomly sampled in [3/4, 4/3] and area randomly sampled in [8%, 100%], then resize the cropped region into a 224-by-224 square image.
- Flip horizontally with 0.5 probability.
- Scale hue, saturation, and brightness with coefficients uniformly drawn from [0.6, 1.4].
- Add PCA noise with a coefficient sampled from a normal distribution [katex]N(0, 0.1)[/katex].
- Normalize RGB channels by subtracting 123.68, 116.779, 103.939 and dividing by 58.393, 57.12, 57.375, respectively.

During validation, we resize each image’s shorter edge to 256 pixels while keeping its aspect ratio. Next, we crop out the 224-by-224 region in the center and normalize RGB channels similar to training.

The weights of both convolutional and fully-connected layers are initialized with Xavier algorithm. In particular, we set the parameter to random values uniformly drawn from [katex][-a, a][/katex], where [katex]a = \sqrt{6/(d_{in} + d_{out})}[/katex]. Here [katex]d_{in}[/katex] and [katex]d_{out}[/katex] are the input and output channel sizes, respectively. All biases are initialized to 0. For batch normalization layers, [katex]\gamma[/katex] vectors are initialized to 1 and [katex]\beta[/katex] vectors to 0.

Nesterov Accelerated Gradient (NAG) descent is used for training.