Skip to main content

Faster R-CNN Summary

One of the object detection models
High accuracy but slow speed
- Used when speed is not required but accuracy is desired

Comparison with Other Object Detection Models

Object detection model accuracy vs. speed

Model Name	Yolo v3	SSD	Faster R-CNN	CenterNet
Accuracy	Low	Slightly low	Slightly high	High
Speed	Fast	Slightly fast	Slightly slow	Slow

History

R-CNN

Detect approximately 2000 object-like regions using Selective Search
- This part uses a hand-crafted algorithm
- Identifies objects based on texture, color, etc.
Crop the detected regions, resize them to a fixed size, and extract features using a CNN
Classify with SVM and adjust BBOX positions with regression

Fast R-CNN

Problems with R-CNN

R-CNN requires training CNN, SVM, and BBOX regression separately
Slow execution time

Improvements

Introduced RoI pooling to eliminate redundant computations during Selective Search to CNN processing
- Added a pooling layer with variable width after the CNN for feature extraction
- This yields fixed-size feature vectors from inputs of varying sizes
Unified CNN, SVM, and BBOX regression into a single model
The overall flow remains the same as R-CNN

Faster R-CNN

Problems with Fast R-CNN

Selective Search is slow
Selective Search (region proposal) is not learned

Improvements

Replaced Selective Search with a Region Proposal Network
- After applying a small CNN, features are extracted using AnchorBOXes (BBOXes) with different aspect ratios and sizes
- Then, an FC layer performs regression for object/non-object classification and BBOX positions
Achieved speedup and higher accuracy through fully end-to-end learning

Similar Models (Details Omitted)

YOLO

Divides the image into a grid
Used for real-time applications

SSD

Uses features from different stages of convolution
Strong with low-resolution and small images

Cascade R-CNN

An improvement on Faster R-CNN
Gradually increases the IoU threshold
Good accuracy but demands significant computational resources

Model Architecture

Paper figure

Feed the image into convolutional layers to produce feature maps
- image - conv_layers - feature_maps
Apply an nxn (n=3 in the paper) conv layer to the feature maps, then use FC-like conv layers (kernel=1, stride=1 to emulate fully connected behavior) to output objectness scores and positions (x1, x2, w, h)
- Region Proposal Network
Compute and classify BBOXes from the feature maps (step 1) and proposed regions (step 2)
- RoI pooling
- classifier

Key Contributions in the Paper

In the RPN, training directly would be dominated by negative samples, so positive and negative samples are sampled at a 1:1 ratio
With the above model architecture, training produces an approximate solution rather than an exact one
- This makes training faster
- In the paper, the RPN and Fast R-CNN components were trained alternately

What Is CenterNet?

There are two models named CenterNet; this refers to the "Objects as Points" version

History

Heatmap-based methods
- Came after Faster R-CNN and similar models
- Use heatmaps instead of anchors
CornerNet (Aug 2018) is the origin
- Instead of anchor-to-BBOX regression, it learns top-left and bottom-right keypoints via heatmaps

Model Architecture

Instead of anchors, predicts the center of objects per class using heatmaps
Properties such as height, width, and class are regressed at each position

Feed the image into a fully convolutional network to generate heatmaps
Identify heatmap peaks (compared against the surrounding 8 positions) as object centers
As needed, estimate object size, depth, orientation, pose, etc. from the feature vector at the object center

Comparison with Other Object Detection Models
History
Model Architecture
Key Contributions in the Paper
History
Model Architecture