Deep Residual Learning for Image Recognition

Learning residual functions significantly improves quality when training deep neural networks

ResNet

Overview

  • Deep convolutional neural networks have led to major breakthroughs in image classification, in large part due to increasing the depth of the network
  • However, there are two issues when simply stacking more layers
    • Vanishing gradients
    • Degradation
  • Vanishing Gradients
    • Vanishing gradients refer to when the gradients of earlier layers become very close to zero, leading to updates of the weights also becoming very close to zero
    • This leads to:
      • Stagnant learning in the earlier layers, which are crucial for learning fundamental features (edges, patterns, etc.)
      • Poor feature extraction and learning in the deeper layers since they build upon features learned in the early layers
    • The vanishing gradient problem often arises when using certain activation functions like the sigmoid function or hyperbolic tangent (tanh) function

      Sigmoid Tanh

    • For large positive or negative inputs to the sigmoid function, the sigmoid function “saturates” near 0 or 1, where the derivative is a very small positive number
    • For large positive or negative inputs to the tanh function, the tanh function “saturates” near -1 or 1, where the derivative is a very small positive number.
    • During backpropagation, the local gradients are continually multiplied together, shrinking the values towards zero
    • Vanishing gradients are largely solved by techniques like:
      • Rectified Linear Unit (ReLU) activation function whose derivative is 1 for all positive inputs
        • $ReLU(x) = max(0, x)$
      • Normalized initialization and intermediate normalization layers
      • Residual networks (this paper)
  • Degradation
    • As the depth of the network increases, accuracy saturates and then degrades rapidly
    • Unexpectedly, this is not caused by overfitting, and adding more layers leads to higher training error

Residual Networks

  • Instead of hoping a few stacked layers directly fit a desired underlying mapping, we let these layers fit a residual mapping
    • Formally, let the underlying desired mapping be $\mathcal{H}(x)$
    • Then the stacked nonlinear layers fit a mapping of $\mathcal{F}(x) \coloneqq \mathcal{H}(x) - x$
    • The original mapping is now recast into $\mathcal{F}(x) + x$
  • The hypothesis is that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping
  • To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers
  • For the implementation, we use shortcut connections which simply perform an identity mapping, and their outputs are added to the outputs of the stacked layers
  • Identity shortcut connections add neither extra parameter nor computation complexity