Fast R-CNN

Complexity arises because detection requires the accurate localization of objects, creating two primary challenges.

  • First, numerous candidate object locations (often called “proposals”) must be processed.
  • Second, these candidates provide only rough localization that must be refined to achieve precise localization.

Solutions to these problems often compromise speed, accuracy, or simplicity.

In this paper, we propose a single-stage training algorithm that jointly learns to classify object proposals and refine their spatial locations.


  • At runtime, the detection network processes image in 0.3s (excluding object proposal time)
  • Achieve top accuracy on PASCAL VOCs 2012 with a mAP of 66% (vs. 62% for R-CNN).

Related works


The Region-based Convolutional Network method (R-CNN) achieves excellent object detection accuracy by using a deep ConvNet to classify objet proposals. R-CNN, however, has notable drawbacks:

  1. Training is a multi-stage pipeline.
  • R-CNN first fine-tunes a ConvNet on object proposals using log loss.
  • Then, it fits SVMs to ConvNet features. These SVMs act as as object detectors, replacing the soft max classifier learnt by fine-tuning.
  • In the third training stage, bounding-box regressions are learned.
  1. Training is expensive in space and time.
  • For SVM and bounding-box regressor training, features are extracted from each object proposal in each image and written to disk.
  1. Object detection is slow.
  • At test-time, features are extracted from each object proposal in each test image.

R-CNN is slow because it performs a ConvNet forward pass for each object proposal, without sharing computation.

Spatial pyramid pooling networks (SPPnet)

Spatial pyramid pooling networks (SPPnets) were proposed to speed up R-CNN by sharing computation.


  • The SPPnet method computes a convolutional feature map for the entire input image and then classifies each object proposal using a feature vector extracted from the shared feature map.
  • Features are extracted for a proposal by max-pooling the portion of the feature map inside the proposal into a fix-size output (e.g., 6 times 6).
  • Multiple output sizes are pooled and then concatenated as spatial pyramid pooling.


  • SPPnet accelerates R-CNN by 10 to 100 times at test time.
  • Training time is also reduced by 3 times due to faster proposal feature extraction.


  • Like R-CNN, training is a multi-stage pipeline that involves extracting features, fine-tuning a network with log loss, training SVMs, and finally it ting bounding-box regressors.
  • Features are also written to disk.
  • But unlike R-CNN, the fine-tuning algorithm proposed in 1 cannot update the convolutional layers that precede the spatial pyramid pooling.
    • Unsurprisingly, this limitation (fixed convolutional layers) limits the accuracy of very deep networks.


Advantages of Fast R-CNN

  1. Higher detection quality (mAP) than R-CNN, SPPnet
  2. Training is single-stage, using a multi-task loss
  3. Training can update all network layers
  4. No disk storage is required for feature caching

Fast R-CNN architecture and training

  • A Fast R-CNN network takes as input an entire image and a set of object proposals.
  • The network first processes the whole image with several convolutional (conv) and max pooling layers to produce a conv feature map.
  • Then, for each object proposal a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map.
  • Each feature vector is fed into a sequence of fully connected (fc) layers that finally branch into two sibling output layers:
    • one that produces softmax probability estimates over K object classes plus a catch-all “background” class, and
    • another layer that output s four real-valued numbers for each of the K object classes.
    • Each set of 4 values encodes refined bounding-box positions for one of the K classes.

The RoI pooling layer

  • The RoI pooling layer uses max pooling to convert the features inside any valid region of interest into a small feature map with a fixed spatial extent of H * W (e.g. 7 * 7), where H and W are layer hyper-parameters that are independent of any particular RoI.
  • In this paper, an ROI is defined by a four-tuple (r, c, h, w) that specifies its top-left corner (r, c) and its height and width (h, w).
  • ROI max pooling works by dividing the h * w ROI window into an H * W gird of sub-windows of approximate size h/H * w/W and then max-pooling the values in each sub-window into the corresponding output grid cell.
  • Pooling is applied independently to each feature map channel, as in standard max pooling.
    • The RoI layer is simply the special-case of the spatial pyramid pooling layer used in SPPnets in which there is only one pyramid level.

Initializing from pre-trained networks

When a pre-trained network initializes a Fast R-CNN network, it undergoes three transformations.

  1. the last max pooling layer is replaced by a RoI pooling layer that is configured by setting H and W to be compatible with the net’s first connected layer (e.g., H = W = 7 for VGG 16).
  2. Second, the network’s last fully connected layer and soft-max (which were trained for 1000-way ImageNet classification) are replaced with the two sibling layers described earlier (a fully connected layer and softmax over K + 1 categories and category-specific bounding-box regressors).
  3. Third, the network is modified to take two data inputs: a list of images and a list of RoIs in those images.

Fine-tuning for detection

Training all network weights with back-propagation is an important capability of Fast R-CNN. First, let’s elucidate why SPPnet is unable to update weights below the spatial pyramid pooling layer.

  • The root cause is that back-propagation through the SPP layer is highly inefficient when each training sample (i.e. RoI) comes from a different image, which is exactly how R-CNN and SPPnet networks are trained.
  • The inefficiency stems from the fact that each RoI may have a very large receptive field, often spanning the entire input image. Since the forward pass must process the entire receptive field, the training inputs are large (often the entire image).

The authors propose a more efficient training method that takes advantage of feature sharing during training. In Fast R-CNN training, stochastic gradient descent (SGD) mini batches are sampled hierarchically, first by sampling N images and then by sampling R/N RoIs from each image.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.