Keywords: AlexNet, Dropout, ReLU

All the figures and tables in this post come from ‘ImageNet Classification with Deep Convolutional Neural Networks’

## Basic Works

1. Large training sets are available
2. GPU
3. Convolutional neural networks, such as ‘Handwritten digit recognition with back-propagation networks’

## Inspiration

1. Controlling the capacity of CNNs by varying their depth and breadth.
2. CNNs can make strong and mostly correct assumptions about the nature of images.

## Contribution

1. Training one of the largest convolutional neural networks to IMageNet ILSVRC-2010

1. architecture 2. GPU is used

3. Some new feature of CNNs improving its performance

1. ReLU Nonlinearity $f(x)= max(0,x)$
• ReLU is much faster than $\tanh(x)$ or $\frac{1}{1+e^{-x}}$
• Jarrett et al.[^3] had tried different other nonlinear function such as $f(x)=|\tanh(x)|$ to prevent overfitting 2. Local Response Normalization
• ReLU does not require input normalization to prevent them from saturating.
• Local response normalization scheme aids generalization:

$b^{i}_{x,y}=\frac{a^i_{x,y}}{(k+\alpha\sum^{\min(N-1,i+n/2)}_{j=\max{0,i-n/2}}(a^j_{x,y})^2)^\beta}$

• where $a^i_{x,y}$ is the activity of a neuron computed by applying kernel $i$ at position $(x,y)$ and the applying the ReLU nonlinearity, the respons-normalized activity $b^i_{x,y}$. And $n$ means ‘adjacent’ kernel maps at the same spatial position and $N$ is the total number of kernels in the layer
• Hyper-parameters whose values are determined using a validation set, and $k=2$, $n=5$, $\alpha = 10^{-4}$ and $\beta =0.75$ is used
• normalization is used after ReLU
• 13% error without nomarlization and 11% with normalization
3. Overlapping Pooling
• Traditionally neighborhoods summarized by adjacent pooling units do not overlap
• if using $stride = 2$ and $kernel\,size=3$ error rates by 0.4%
4. preventing overfitting

1. Data augmentation
• extracting random patches from a bigger image
• horizontal reflections
• in testing, test 5 patches as well as their horizontal reflections and averaging the prediction of all this prediction(softmax output)
• altering the intensities of the RGB channels in training image, PCA first:

$\begin{bmatrix}\boldsymbol{p}_1&\boldsymbol{p}_2&\boldsymbol{p}_3\end{bmatrix} \begin{bmatrix}\alpha_1 \lambda_1\\\alpha_2 \lambda_2\\\alpha_3 \lambda_3\\\end{bmatrix}$

• where $\boldsymbol{p}_i$ is eigenvector and $\lambda_i$ is eigenvalue and $\alpha_i$ is random variable. And to a particular training image $\alpha_i$ is drawn only once. And when the image will be used again, $\alpha_i$ could be updated
• dropout: select some neurons and set their output $0$ with probability 0.5. This means each time for the input data, they are using a different architecture.
• dropout is used in the first two full-connected layers in figure 2.
• without dropout the network exhibits substantial overfitting.
• dropout roughly doubles the number of iterations required to converge.

## Experiment

1. batch size of 128 examples, momentum 0.9, and weight decay of 0.0005:

\begin{aligned} v_{i+1} &:= 0.9 \cdot v_i - 0.0005\cdot\epsilon\cdot w_i-\epsilon(\frac{\partial L}{\partial w}|_{w_i})_{D_i}\\ w_{i+1} &:= w_i + v_{i+1} \end{aligned}

• weight decay is no merely a regularizer, it reduces the model’s training error
• where $i$ is iteration index, $v$ is the momentum variable, $\epsilon$ is the learning rate, $(\frac{\partial L}{\partial w}|_{w_i})_{D_i}$ is the average gradient of batch $D_i$ with respect to $w$, evaluated at $w_i$
1. equal learning rate for each layer.
2. when the validation error rate stopped improving, divide the learning rate by $10$.
3. learning rate initialized at $0.01$

## Result  ## Personal Summary

This paper rebirth CNNs in 2012. GPUs are an important helper for training CNNs.

1. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” In Advances in neural information processing systems, pp. 1097-1105. 2012. ↩︎

2. LeCun, Yann, Bernhard E. Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne E. Hubbard, and Lawrence D. Jackel. “Handwritten digit recognition with a back-propagation network.” In Advances in neural information processing systems, pp. 396-404. 1990. ↩︎