Skip to main content

Faster R-CNN Summary

  • One of the object detection models
  • High accuracy but slow speed
    • Used when speed is not required but accuracy is desired

Comparison with Other Object Detection Models

Object detection model accuracy vs. speed

Model NameYolo v3SSDFaster R-CNNCenterNet
AccuracyLowSlightly lowSlightly highHigh
SpeedFastSlightly fastSlightly slowSlow

History

R-CNN

  1. Detect approximately 2000 object-like regions using Selective Search
    • This part uses a hand-crafted algorithm
    • Identifies objects based on texture, color, etc.
  2. Crop the detected regions, resize them to a fixed size, and extract features using a CNN
  3. Classify with SVM and adjust BBOX positions with regression

Fast R-CNN

Problems with R-CNN

  • R-CNN requires training CNN, SVM, and BBOX regression separately
  • Slow execution time

Improvements

  • Introduced RoI pooling to eliminate redundant computations during Selective Search to CNN processing
    • Added a pooling layer with variable width after the CNN for feature extraction
    • This yields fixed-size feature vectors from inputs of varying sizes
  • Unified CNN, SVM, and BBOX regression into a single model
  • The overall flow remains the same as R-CNN

Faster R-CNN

Problems with Fast R-CNN

  • Selective Search is slow
  • Selective Search (region proposal) is not learned

Improvements

  • Replaced Selective Search with a Region Proposal Network
    • After applying a small CNN, features are extracted using AnchorBOXes (BBOXes) with different aspect ratios and sizes
    • Then, an FC layer performs regression for object/non-object classification and BBOX positions
  • Achieved speedup and higher accuracy through fully end-to-end learning

Similar Models (Details Omitted)

YOLO

  • Divides the image into a grid
  • Used for real-time applications

SSD

  • Uses features from different stages of convolution
  • Strong with low-resolution and small images

Cascade R-CNN

  • An improvement on Faster R-CNN
  • Gradually increases the IoU threshold
  • Good accuracy but demands significant computational resources

Model Architecture

Paper figure

  1. Feed the image into convolutional layers to produce feature maps
    • image - conv_layers - feature_maps
  2. Apply an nxn (n=3 in the paper) conv layer to the feature maps, then use FC-like conv layers (kernel=1, stride=1 to emulate fully connected behavior) to output objectness scores and positions (x1, x2, w, h)
    • Region Proposal Network
  3. Compute and classify BBOXes from the feature maps (step 1) and proposed regions (step 2)
    • RoI pooling
    • classifier

Key Contributions in the Paper

  • In the RPN, training directly would be dominated by negative samples, so positive and negative samples are sampled at a 1:1 ratio
  • With the above model architecture, training produces an approximate solution rather than an exact one
    • This makes training faster
    • In the paper, the RPN and Fast R-CNN components were trained alternately

What Is CenterNet?

  • There are two models named CenterNet; this refers to the "Objects as Points" version

History

  • Heatmap-based methods
    • Came after Faster R-CNN and similar models
    • Use heatmaps instead of anchors
  • CornerNet (Aug 2018) is the origin
    • Instead of anchor-to-BBOX regression, it learns top-left and bottom-right keypoints via heatmaps

Model Architecture

  • Instead of anchors, predicts the center of objects per class using heatmaps
  • Properties such as height, width, and class are regressed at each position
  1. Feed the image into a fully convolutional network to generate heatmaps
  2. Identify heatmap peaks (compared against the surrounding 8 positions) as object centers
  3. As needed, estimate object size, depth, orientation, pose, etc. from the feature vector at the object center