ImageNet Classification with Deep Convolutional Neural Networks

Keywords: AlexNet, Dropout, ReLU

All the figures and tables in this post come from ‘ImageNet Classification with Deep Convolutional Neural Networks’[1]

Basic Works

  1. Large training sets are available
  2. GPU
  3. Convolutional neural networks, such as ‘Handwritten digit recognition with back-propagation networks’[2]


  1. Controlling the capacity of CNNs by varying their depth and breadth.
  2. CNNs can make strong and mostly correct assumptions about the nature of images.


  1. Training one of the largest convolutional neural networks to IMageNet ILSVRC-2010

    1. architecture
  2. GPU is used

  3. Some new feature of CNNs improving its performance

    1. ReLU Nonlinearity f(x)=max(0,x)f(x)= max(0,x)
      • ReLU is much faster than tanh(x)\tanh(x) or 11+ex\frac{1}{1+e^{-x}}
      • Jarrett et al.[^3] had tried different other nonlinear function such as f(x)=tanh(x)f(x)=|\tanh(x)| to prevent overfitting
    2. Local Response Normalization
      • ReLU does not require input normalization to prevent them from saturating.
      • Local response normalization scheme aids generalization:

      bx,yi=ax,yi(k+αj=max0,in/2min(N1,i+n/2)(ax,yj)2)β b^{i}_{x,y}=\frac{a^i_{x,y}}{(k+\alpha\sum^{\min(N-1,i+n/2)}_{j=\max{0,i-n/2}}(a^j_{x,y})^2)^\beta}

      • where ax,yia^i_{x,y} is the activity of a neuron computed by applying kernel ii at position (x,y)(x,y) and the applying the ReLU nonlinearity, the respons-normalized activity bx,yib^i_{x,y}. And nn means ‘adjacent’ kernel maps at the same spatial position and NN is the total number of kernels in the layer
      • Hyper-parameters whose values are determined using a validation set, and k=2k=2, n=5n=5, α=104\alpha = 10^{-4} and β=0.75\beta =0.75 is used
      • normalization is used after ReLU
      • 13% error without nomarlization and 11% with normalization
    3. Overlapping Pooling
      • Traditionally neighborhoods summarized by adjacent pooling units do not overlap
      • if using stride=2stride = 2 and kernelsize=3kernel\,size=3 error rates by 0.4%
  4. preventing overfitting

    1. Data augmentation
      • extracting random patches from a bigger image
      • horizontal reflections
      • in testing, test 5 patches as well as their horizontal reflections and averaging the prediction of all this prediction(softmax output)
      • altering the intensities of the RGB channels in training image, PCA first:

      [p1p2p3][α1λ1α2λ2α3λ3] \begin{bmatrix}\boldsymbol{p}_1&\boldsymbol{p}_2&\boldsymbol{p}_3\end{bmatrix} \begin{bmatrix}\alpha_1 \lambda_1\\\alpha_2 \lambda_2\\\alpha_3 \lambda_3\\\end{bmatrix}

      • where pi\boldsymbol{p}_i is eigenvector and λi\lambda_i is eigenvalue and αi\alpha_i is random variable. And to a particular training image αi\alpha_i is drawn only once. And when the image will be used again, αi\alpha_i could be updated
      • dropout: select some neurons and set their output 00 with probability 0.5. This means each time for the input data, they are using a different architecture.
      • dropout is used in the first two full-connected layers in figure 2.
      • without dropout the network exhibits substantial overfitting.
      • dropout roughly doubles the number of iterations required to converge.


  1. batch size of 128 examples, momentum 0.9, and weight decay of 0.0005:

vi+1:=0.9vi0.0005ϵwiϵ(Lwwi)Diwi+1:=wi+vi+1\begin{aligned} v_{i+1} &:= 0.9 \cdot v_i - 0.0005\cdot\epsilon\cdot w_i-\epsilon(\frac{\partial L}{\partial w}|_{w_i})_{D_i}\\ w_{i+1} &:= w_i + v_{i+1} \end{aligned}

  • weight decay is no merely a regularizer, it reduces the model’s training error
  • where ii is iteration index, vv is the momentum variable, ϵ\epsilon is the learning rate, (Lwwi)Di(\frac{\partial L}{\partial w}|_{w_i})_{D_i} is the average gradient of batch DiD_i with respect to ww, evaluated at wiw_i
  1. equal learning rate for each layer.
  2. when the validation error rate stopped improving, divide the learning rate by 1010.
  3. learning rate initialized at 0.010.01


Personal Summary

This paper rebirth CNNs in 2012. GPUs are an important helper for training CNNs.


  1. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” In Advances in neural information processing systems, pp. 1097-1105. 2012. ↩︎

  2. LeCun, Yann, Bernhard E. Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne E. Hubbard, and Lawrence D. Jackel. “Handwritten digit recognition with a back-propagation network.” In Advances in neural information processing systems, pp. 396-404. 1990. ↩︎