Vanishing gradient problem

In

hyperbolic tangent function have gradients in the range

[-1,1]

, and backpropagation computes gradients by the chain rule

. This has the effect of multiplying

n

of these small numbers to compute gradients of the early layers in an

n

-layer network, meaning that the gradient (error signal) decreases exponentially with

n

while the early layers train very slowly.

Back-propagation allowed researchers to train supervised deep artificial neural networks from scratch, initially with little success. Hochreiter's diplom thesis of 1991 formally identified the reason for this failure in the "vanishing gradient problem",^[2]^[3] which not only affects many-layered feedforward networks,^[4] but also recurrent networks.^[5] The latter are trained by unfolding them into very deep feedforward networks, where a new layer is created for each time step of an input sequence processed by the network. (The combination of unfolding and backpropagation is termed backpropagation through time.)

When activation functions are used whose derivatives can take on larger values, one risks encountering the related exploding gradient problem.

Prototypical models

This section is based on the paper On the difficulty of training Recurrent Neural Networks by Pascanu, Mikolov, and Bengio.^[5]

Recurrent network model

A generic recurrent network has hidden states $h_{1},h_{2},...$ inputs $u_{1},u_{2},...$ , and outputs $x_{1},x_{2},...$ . Let it be parametrized by $\theta$ , so that the system evolves as

(h_{t},x_{t})=F(h_{t-1},u_{t},\theta )

Often, the output

x_{t}

is a function of

h_{t}

, as some

x_{t}=G(h_{t})

. The vanishing gradient problem already presents itself clearly when

x_{t}=h_{t}

, so we simplify our notation to the special case with:

x_{t}=F(x_{t-1},u_{t},\theta )

Now, take its differential:

{\begin{aligned}dx_{t}&=\nabla _{\theta }F(x_{t-1},u_{t},\theta )d\theta +\nabla _{x}F(x_{t-1},u_{t},\theta )dx_{t-1}\\&=\nabla _{\theta }F(x_{t-1},u_{t},\theta )d\theta +\nabla _{x}F(x_{t-1},u_{t},\theta )(\nabla _{\theta }F(x_{t-2},u_{t-1},\theta )d\theta +\nabla _{x}F(x_{t-2},u_{t-1},\theta )dx_{t-2})\\&=\cdots \\&=\left(\nabla _{\theta }F(x_{t-1},u_{t},\theta )+\nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{\theta }F(x_{t-2},u_{t-1},\theta )+\cdots \right)d\theta \end{aligned}}

Training the network requires us to define a loss function to be minimized. Let it be

L(x_{T},u_{1},...,u_{T})

^{[note 1]}, then minimizing it by gradient descent gives

dL=\nabla _{x}L(x_{T},u_{1},...,u_{T})\left(\nabla _{\theta }F(x_{t-1},u_{t},\theta )+\nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{\theta }F(x_{t-2},u_{t-1},\theta )+\cdots \right)d\theta

(loss differential)

\Delta \theta =-\eta \cdot \left[\nabla _{x}L(x_{T})\left(\nabla _{\theta }F(x_{t-1},u_{t},\theta )+\nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{\theta }F(x_{t-2},u_{t-1},\theta )+\cdots \right)\right]^{T}

where

\eta

is the learning rate.

The vanishing/exploding gradient problem appears because there are repeated multiplications, of the form

\nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{x}F(x_{t-2},u_{t-1},\theta )\nabla _{x}F(x_{t-3},u_{t-2},\theta )\cdots

Example: recurrent network with sigmoid activation

For a concrete example, consider a typical recurrent network defined by

x_{t}=F(x_{t-1},u_{t},\theta )=W_{rec}\sigma (x_{t-1})+W_{in}u_{t}+b

where

\theta =(W_{rec},W_{in})

is the network parameter,

\sigma

is the sigmoid activation function^{[note 2]}, applied to each vector coordinate separately, and

b

is the bias vector.

Then, $\nabla _{x}F(x_{t-1},u_{t},\theta )=W_{rec}\mathop {diag} (\sigma '(x_{t-1}))$ , and so

{\begin{aligned}\nabla _{x}F(x_{t-1},u_{t},\theta )&\nabla _{x}F(x_{t-2},u_{t-1},\theta )\cdots \nabla _{x}F(x_{t-k},u_{t-k+1},\theta )\\=W_{rec}\mathop {diag} (\sigma '(x_{t-1}))&W_{rec}\mathop {diag} (\sigma '(x_{t-2}))\cdots W_{rec}\mathop {diag} (\sigma '(x_{t-k}))\end{aligned}}

Since

|\sigma '|\leq 1

, the operator norm of the above multiplication is bounded above by

\|W_{rec}\|^{k}

. So if the spectral radius of

W_{rec}

is

\gamma <1

, then at large

k

, the above multiplication has operator norm bounded above by

\gamma ^{k}\to 0

. This is the prototypical vanishing gradient problem.

The effect of a vanishing gradient is that the network cannot learn long-range effects. Recall Equation (loss differential):

\nabla _{\theta }L=\nabla _{x}L(x_{T},u_{1},...,u_{T})\left(\nabla _{\theta }F(x_{t-1},u_{t},\theta )+\nabla _{x}F(x_{t-1},u_{t},\theta )\nabla _{\theta }F(x_{t-2},u_{t-1},\theta )+\cdots \right)

The components of

\nabla _{\theta }F(x,u,\theta )

are just components of

\sigma (x)

and

u

, so if

u_{t},u_{t-1},...

are bounded, then

\|\nabla _{\theta }F(x_{t-k-1},u_{t-k},\theta )\|

is also bounded by some

M>0

, and so the terms in

\nabla _{\theta }L

decay as

M\gamma ^{k}

. This means that, effectively,

\nabla _{\theta }L

is affected only by the first

O(\gamma ^{-1})

terms in the sum.

If $\gamma \geq 1$ , the above analysis does not quite work.^{[note 3]} For the prototypical exploding gradient problem, the next model is clearer.

Dynamical systems model

Following (Doya, 1993),^[6] consider this one-neuron recurrent network with sigmoid activation:

x_{t+1}=(1-\epsilon )x_{t}+\epsilon \sigma (wx_{t}+b)+\epsilon w'u_{t}

At the small

\epsilon

limit, the dynamics of the network becomes

{\frac {dx}{dt}}=-x(t)+\sigma (wx(t)+b)+w'u(t)

Consider first the autonomous case, with

u=0

. Set

w=5.0

, and vary

b

in

[-3,-2]

. As

b

decreases, the system has 1 stable point, then has 2 stable points and 1 unstable point, and finally has 1 stable point again. Explicitly, the stable points are

(x,b)=\left(x,\ln \left({\frac {x}{1-x}}\right)-5x\right)

.

Now consider ${\frac {\Delta x(T)}{\Delta x(0)}}$ and ${\frac {\Delta x(T)}{\Delta b}}$ , where $T$ is large enough that the system has settled into one of the stable points.

If $(x(0),b)$ puts the system very close to an unstable point, then a tiny variation in $x(0)$ or $b$ would make $x(T)$ move from one stable point to the other. This makes ${\frac {\Delta x(T)}{\Delta x(0)}}$ and ${\frac {\Delta x(T)}{\Delta b}}$ both very large, a case of the exploding gradient.

If $(x(0),b)$ puts the system far from an unstable point, then a small variation in $x(0)$ would have no effect on $x(T)$ , making ${\frac {\Delta x(T)}{\Delta x(0)}}=0$ , a case of the vanishing gradient.

Note that in this case, ${\frac {\Delta x(T)}{\Delta b}}\approx {\frac {\partial x(T)}{\partial b}}=\left({\frac {1}{x(T)(1-x(T))}}-5\right)^{-1}$ neither decays to zero nor blows up to infinity. Indeed, it's the only well-behaved gradient, which explains why early researches focused on learning or designing recurrent networks systems that could perform long-ranged computations (such as outputting the first input it sees at the very end of an episode) by shaping its stable attractors.^[7]

For the general case, the intuition still holds (^[5] Figures 3, 4, and 5).

Geometric model

Continue using the above one-neuron network, fixing $w=5,x(0)=0.5,u(t)=0$ , and consider a loss function defined by $L(x(T))=(0.855-x(T))^{2}$ . This produces a rather pathological loss landscape: as $b$ approach $-2.5$ from above, the loss approaches zero, but as soon as $b$ crosses $-2.5$ , the attractor basin changes, and loss jumps to 0.50.^{[note 4]}

Consequently, attempting to train $b$ by gradient descent would "hit a wall in the loss landscape", and cause exploding gradient. A slightly more complex situation is plotted in,^[5] Figures 6.

Solutions

To overcome this problem, several methods were proposed.

Batch normalization

Batch normalization is a standard method for solving both the exploding and the vanishing gradient problems.^[8]^[9]

Multi-level hierarchy

One is Jürgen Schmidhuber's multi-level hierarchy of networks (1992) pre-trained one level at a time through unsupervised learning, fine-tuned through backpropagation.^[10] Here each level learns a compressed representation of the observations that is fed to the next level.

Related approach

Similar ideas have been used in feed-forward neural networks for unsupervised pre-training to structure a neural network, making it first learn generally useful

log likelihood of the data, thus improving the model, if trained properly. Once sufficiently many layers have been learned the deep architecture may be used as a generative model by reproducing the data when sampling down the model (an "ancestral pass") from the top level feature activations.^[11] Hinton reports that his models are effective feature extractors over high-dimensional, structured data.^[12]

Long short-term memory

Another technique particularly used for

ICDAR 2009 competitions in connected handwriting recognition, without any prior knowledge about the three different languages to be learned.^[14]^[15]

Faster hardware

Hardware advances have meant that from 1991 to 2015, computer power (especially as delivered by GPUs) has increased around a million-fold, making standard backpropagation feasible for networks several layers deeper than when the vanishing gradient problem was recognized. Schmidhuber notes that this "is basically what is winning many of the image recognition competitions now", but that it "does not really overcome the problem in a fundamental way"^[16] since the original models tackling the vanishing gradient problem by Hinton and others were trained in a Xeon processor, not GPUs.^[11]

Residual networks

One of the newest and most effective ways to resolve the vanishing gradient problem is with residual neural networks,^[17] or ResNets (not to be confused with recurrent neural networks). ResNets refer to neural networks where skip connections or residual connections are part of the network architecture. These skip connections allow gradient information to pass through the layers, by creating "highways" of information, where the output of a previous layer/activation is added to the output of a deeper layer. This allows information from the earlier parts of the network to be passed to the deeper parts of the network, helping maintain signal propagation even in deeper networks. Skip connections are a critical component of what allowed successful training of deeper neural networks.

ResNets yielded lower training error (and test error) than their shallower counterparts simply by reintroducing outputs from shallower layers in the network to compensate for the vanishing data.^[17] Note that ResNets are an ensemble of relatively shallow nets and do not resolve the vanishing gradient problem by preserving gradient flow throughout the entire depth of the network – rather, they avoid the problem simply by constructing ensembles of many short networks together. (Ensemble by Construction^[18])

Other activation functions

ReLU suffer less from the vanishing gradient problem, because they only saturate in one direction.^[19]

Weight initialization

Weight initialization is another approach that has been proposed to reduce the vanishing gradient problem in deep networks.

Recently, Yilmaz and Poli^[21] performed a theoretical analysis on how gradients are affected by the mean of the initial weights in deep neural networks using the logistic activation function and found that gradients do not vanish if the mean of the initial weights is set according to the formula: max(−1,-8/N). This simple strategy allows networks with 10 or 15 hidden layers to be trained very efficiently and effectively using the standard backpropagation.

Other

Behnke relied only on the sign of the gradient (Rprop) when training his Neural Abstraction Pyramid^[22] to solve problems like image reconstruction and face localization.^{[citation needed]}

Neural networks can also be optimized by using a universal search algorithm on the space of neural network's weights, e.g.,

random guess or more systematically genetic algorithm. This approach is not based on gradient and avoids the vanishing gradient problem.^[23]

Notes

^ A more general loss function could depend on the entire sequence of outputs, as $L(x_{1},...,x_{T},u_{1},...,u_{T})=\sum _{t=1}^{T}{\mathcal {E}}(x_{t},u_{1},...,u_{t})$ for which the problem is the same, just with more complex notations.
^ Any activation function works, as long as it is differentiable with bounded derivative.
^ Consider $W_{rec}={\begin{bmatrix}0&2\\\epsilon &0\end{bmatrix}}$ and $D={\begin{bmatrix}c&0\\0&c\end{bmatrix}}$ , with $\epsilon >{\frac {1}{2}}$ and $c\in (0,1)$ . Then $W_{rec}$ has spectral radius ${\sqrt {2\epsilon }}>1$ , and $(W_{rec}D)^{2N}=(2\epsilon \cdot c^{2})^{N}I_{2\times 2}$ , which might go to infinity or zero depending on choice of $c$ .
^ This is because at $b=-2.5$ , the two stable attractors are $x=0.145,0.855$ , and the unstable attractor is $x=0.5$ .

References

^
S2CID 219792172
.

^ Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen (PDF) (Diplom thesis). Institut f. Informatik, Technische Univ. Munich.
ISBN 0-7803-5369-2
.

S2CID 6831636
.

^
arXiv:1211.5063 [cs.LG
].

S2CID 15069221
.

S2CID 206457500
.

arXiv:1502.03167
.

^ Santurkar, Shibani; Tsipras, Dimitris; Ilyas, Andrew; Madry, Aleksander (2018). "How Does Batch Normalization Help Optimization?". Advances in Neural Information Processing Systems. 31. Curran Associates, Inc.

^ J. Schmidhuber., "Learning complex, extended sequences using the principle of history compression," Neural Computation, 4, pp. 234–242, 1992.

^
S2CID 2309950
.

doi:10.4249/scholarpedia.5947
.

S2CID 1915014
.

^ Graves, Alex; and Schmidhuber, Jürgen; Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks, in Bengio, Yoshua; Schuurmans, Dale; Lafferty, John; Williams, Chris K. I.; and Culotta, Aron (eds.), Advances in Neural Information Processing Systems 22 (NIPS'22), December 7th–10th, 2009, Vancouver, BC, Neural Information Processing Systems (NIPS) Foundation, 2009, pp. 545–552

S2CID 14635907
.

S2CID 11715509
.

^
ISBN 978-1-4673-8851-1
.

arXiv:1605.06431 [cs.CV
].

^ Glorot, Xavier; Bordes, Antoine; Bengio, Yoshua (14 June 2011). "Deep Sparse Rectifier Neural Networks". PMLR: 315–323.

^ Kumar, Siddharth Krishna. "On weight initialization in deep neural networks." arXiv preprint arXiv:1704.08863 (2017).

S2CID 249487697
.

^ Sven Behnke (2003). Hierarchical Neural Networks for Image Interpretation (PDF). Lecture Notes in Computer Science. Vol. 2766. Springer.

^ "Sepp Hochreiter's Fundamental Deep Learning Problem (1991)". people.idsia.ch. Retrieved 7 January 2017.

Retrieved from "https://en.wikipedia.org/w/index.php?title=Vanishing_gradient_problem&oldid=1199982033"

[let_it_be_L-6] A more general loss function could depend on the entire sequence of outputs, as $L(x_{1},...,x_{T},u_{1},...,u_{T})=\sum _{t=1}^{T}{\mathcal {E}}(x_{t},u_{1},...,u_{t})$ for which the problem is the same, just with more complex notations.

[sigmoid_activation_function-7] Any activation function works, as long as it is differentiable with bounded derivative.

[not_quite_work-8] Consider $W_{rec}={\begin{bmatrix}0&2\\\epsilon &0\end{bmatrix}}$ and $D={\begin{bmatrix}c&0\\0&c\end{bmatrix}}$ , with $\epsilon >{\frac {1}{2}}$ and $c\in (0,1)$ . Then $W_{rec}$ has spectral radius ${\sqrt {2\epsilon }}>1$ , and $(W_{rec}D)^{2N}=(2\epsilon \cdot c^{2})^{N}I_{2\times 2}$ , which might go to infinity or zero depending on choice of $c$ .

[attractor-11] This is because at $b=-2.5$ , the two stable attractors are $x=0.145,0.855$ , and the unstable attractor is $x=0.5$ .

[Basodi2020-1] 
S2CID 219792172
.

[2] Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen (PDF) (Diplom thesis). Institut f. Informatik, Technische Univ. Munich.

[3] ISBN 0-7803-5369-2
.

[4] S2CID 6831636
.

[:1-5] 
arXiv:1211.5063 [cs.LG
].

[9] S2CID 15069221
.

[10] S2CID 206457500
.

[12] rXiv:1502.03167
.

[13] Santurkar, Shibani; Tsipras, Dimitris; Ilyas, Andrew; Madry, Aleksander (2018). "How Does Batch Normalization Help Optimization?". Advances in Neural Information Processing Systems. 31. Curran Associates, Inc.

[SCHMID1992-14] J. Schmidhuber., "Learning complex, extended sequences using the principle of history compression," Neural Computation, 4, pp. 234–242, 1992.

[hinton2006-15] 
S2CID 2309950
.

[16] :10.4249/scholarpedia.5947
.

[lstm-17] S2CID 1915014
.

[18] Graves, Alex; and Schmidhuber, Jürgen; Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks, in Bengio, Yoshua; Schuurmans, Dale; Lafferty, John; Williams, Chris K. I.; and Culotta, Aron (eds.), Advances in Neural Information Processing Systems 22 (NIPS'22), December 7th–10th, 2009, Vancouver, BC, Neural Information Processing Systems (NIPS) Foundation, 2009, pp. 545–552

[19] S2CID 14635907
.

[20] S2CID 11715509
.

[He2015-21] 
ISBN 978-1-4673-8851-1
.

[22] rXiv:1605.06431 [cs.CV
].

[23] Glorot, Xavier; Bordes, Antoine; Bengio, Yoshua (14 June 2011). "Deep Sparse Rectifier Neural Networks". PMLR: 315–323.

[24] Kumar, Siddharth Krishna. "On weight initialization in deep neural networks." arXiv preprint arXiv:1704.08863 (2017).

[25] S2CID 249487697
.

[26] Sven Behnke (2003). Hierarchical Neural Networks for Image Interpretation (PDF). Lecture Notes in Computer Science. Vol. 2766. Springer.

[27] "Sepp Hochreiter's Fundamental Deep Learning Problem (1991)". people.idsia.ch. Retrieved 7 January 2017.

[2]

[3]

[4]

[5]

[note 1]

[note 2]

[note 3]

[6]

[7]

[note 4]

[8]

[9]

[10]

[11]

[12]

[14]

[15]

[16]

[17]

[18]

[19]

[21]

[22]

[23]