Notes of Deep Learning Lessons:L4-CNN-Obeject Detection

Paper Notes

Deep Learning

Publish Date: 2020-12-14

Word Count: 690

OverFeat

This paper shows that different task can be learned simultaneously using a single shared network.
It is the first to provide a clear explantion how ConVNets can be used for localization and detection for ImageNet data.
Multi-scale classification not only ignores some regions of the image, but also be comutationaly redundant if views overlap.
While the sliding window approach may be computationaly prohibitive for certain types of models, it is inherently efficient in the case of ConVNets, this approach yields significantly more views for voting, which increases robustness while remaining computationaly efficient.
The paper also shows a multi-scale sliding window approach that can be used for classification, localization and detection.

YOLO frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities.
A single neural predicts bounding boxes and class probabilities directly from full images in one evaluation.
Compared to state-of-the-art detectionsystems, YOLO makes more localization errors but is less likely to predict false positives on background.
Limitations of YOLO are as follows:

YOLO imposes strong spatial constraints on bounding box predictions since each grid cell predicts two boxes and can only have one class. This spatial constraints limits the number of nearby objects that our model can predict. YOLO struggles with small objects that appear in groups, such as flocks of birds.
Since YOLO learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratio or configurations.
While YOLO train on a loss function that approximates detection performance, loss function treats errors the same in small bounding boxes versus large bounding boxes.A small error in a large box is generally benign but a small error in a small box has a much greater effect on IoU.YOLO’s main source of error is incorrect localizations.

Generates category-independent region proposals.These proposal define the set of candidate bounding boxes available to our detector.
A large convolutional neural network that extracts a fixed-length feature vector from each region.
A set of class-specific linear SVMs.

A Fast R-CNN network takes as input an entire image and a set of object proposals.
The network first processes the whole image with several convolutional and max pooling layer to produce a conV feature map.
Then, for each object proposal a region of interest (RoI) pooling layer extracts fixed-length feature vector from the feature map.
Each feature vector is fed into a sequence of fully connected (fc) layers that finally branch into two sibling output layers:

Produces softmax probability estimates over K object classes plus a catch-all “background” class.
Outputs four real-valued numbers for each of the Kobject classes. Each set of 4 values encodes refined bounding-box positions for one of the K classes.

The paper introduces a Region Proposal Network(RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals.
An RPN is fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection.
Faster R-CNN merge RPN and Fast R-CNN into a single network by sharing their convolutional features–using the recently popular terminology of neural networks with “attention” mechanism, the RPN component tells the unified network where to look.