Notes of Deep Learning Lessons:L4-CNN-Case Studies of CNN

Paper Notes

Deep Learning

Publish Date: 2020-12-01

Word Count: 842

LeNet

This paper is written in 1998,peple didn’t really use padding or using valid convolutions, so after applying ConV layer, the height and width shrinks.
The architecture of LeNet is:
```
ConV+pool->ConV+pool->FC->FC->output  
```
LeNet uses sigmoid & tanh, not ReLu.
Graph Transformer Network(GTN) isn’t widely used today

AlexNet

AlexNet has a lot of similarities to LeNet, but much bigger.(LeNet has 6w parameters and AlexNet has 6000w parameters).
AlexNet uses ReLu as its activation function.
Uses Data augmentation and Dropout to reduce overfitting. The first form of data augmentation consists of generating image translations and horizontal reflections, the second form of data augmentation consists of altering the intensities of RGB channels in training images.
When this paper is written ,GPUs were still a little bit slow, so it had a complicated way of training on 2 GPUs.
Local Response Normalization(LRN) emphasized in this paper isn’t really used much.
It was really this paper that convinced a lot of CV community to take a serious look at Deep Learning and beyond CV as well.

VGG

Instead of having so many hyperparameters, VGG uses a simpler network focus just on ConV layers(ConV:33,s=1,same; MAX-POOL:22,s=2).This really simplified these NN architecture.
Number of parameters are vary large(138 Million).
This paper reveals the pattern of how as you go deeper, making the rate goes down and up systematic.
VGG confirm the importance of Depth in visual representations.

ResNet

In theory, having a deeper Network would help, but in practice, having a plain Network that is very deep means that your optimization algorithm has a harder training, and your training error gets worse, ResNet can produce no higher training error than its shallower counterpart and help you to solve this problem.
If the optimal function is closer to an identity mapping than to a zero mapping, it should be easier for the solver to find the perturbations with reference to an identity mapping, than to learn the function as a new one.The identy function is easy for residual block to learn since it is easy to get a[L+2] = a[L] because of this skip connection.
Assuming z[L+2] and a[L] have the same dimension, so actually this short-cut because same ConV preserves dimensions.
This paper focuses on the behavior of extremely deep Networks, not on pushing the state-of-art result, so uses simple architectures.

Network in Network

With enhanced local modeling via the micro Network, we are able to utilize global average pooling over feature maps in the classification layer, which is easier to interpret and less prone to overfitting than traditional FClayers.
Even though the details of the architecture in this paper aren’t used widely, this idea of a *11 convolution** or NIN idea has been very influential.
This paper uses a micro Network structure to replace Generalized Linear Model(GLM).

GoogLeNet

Instead of choosing what filter size you want in ConV layers or pooling layers, you can do them all in the Inception module.
The Inception module uses *11 ConV layer** to create a bottleneck layer to reducing the computational cost signigicantly.
The main hallmark of this architecture is the improved utilization of the computing resources inside the Network. By a carefully crafted design, this paper increased the depth and width of the Network while keeping the computational budget constant.
The most straightforward way of improving the performance of deep Neural Network is by increasing their size. This includes both increasing the depth - the number of Network levels - as well as its width: the number of units at each level. However, this simple solution comes with two major drawbacks:
1.Bigger size typically means a larger number of parameters, which makes the enlarged Network more prone to overfitting.
2.The dramatically increased use of computational resources.
A fundamental way of solving both of these issues would be to introduce sparsity and replace the FC layers by sparse ones, even inside the convolutions.
If the probability distribution of the dataset is representable by a large, very sparse deep Neural Network, then the optimal Network topology can be constructed layer after layer by analyzing the correlation statistics of the preceding layer activations and clustering neurons with highly correlated outputs.
The main idea of the Inception architecture is to consider how an optimal local sparse structure of a convolutional vision Network can be approximated and covered by readily available dense components.
The design follows the practical intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from the diffrent scales simultaneously.
The function of the branches in GoogLeNet is to take some hidden layers and try to use that to make a prediction. It helps ensure that the feature’s computed even in the hidden layers or intermediate layers. This appears to have a regularizing effect on the Inception Network and prevent this Network from overfitting.