europe: Better Deep Learning - Jason Brownlee

Introduction

This post is a summary of my notes related to the excellent book from Jason Brownlee, "Better Deep Learning".
Only Chapter 8, related to fix exploding gradient with gradient clipping is explored.
This is a followup of previous post related to the fixing of vanishing gradient.

Valmestroff, Lorraine, France

The exploding gradient problem

Training a neural network can become unstable given the choice of the error function, learning rate, or even the scale of the target variable. Large updates to weights during training can cause a numerical overflow or underflow often referred as exploding gradients. The problem of exploding gradients is more common with recurrent neural networks, such as LSTM given the accumulation of gradients unrolled or over hundreds of input time steps. A common and relatively easy solution to the exploding gradients problems is to change the derivative of the error before propagating it backward through the network and using it to update the weights. Two approaches include rescaling the gradients given a chosen vector norm and clipping gradient values that exceed a preferred range. Together, these methods are referred to as gradient clipping.
Neural networks are trained using the stochastic gradient descent optimization algorithm. This requires first the estimation of the loss on one or more training examples, then the calculation of the derivative of the loss, which is propagated backward through the network in order to update the weights. Weights are updated using a fraction of the back propagated error control by the learning rate. It is possible for the updates to the weights to be so large that the weights either overflow or underflow their numerical precision. In practice, the weights can take the value of an NaN (not a number) of Inf (infinity) when they overflow or underflow and for practical purposes the network will be useless from that point forward, forever predicting NaN values as signals flow through the invalide weights.
Some case cases of exploding gradients:

poor choice of learning rate that results in large weights updates
poor choice of data preparation, allowing large differences in the target variable
poor choice of loss function, allowing the calculation of large error values

Exploding gradients can be avoided in general by careful configuration of the network model, such as choice of a small learning rate, scaled target variables, and a standard loss function.
A common solution to exploding gradients is ti change the error derivative before propagating it backward through the network and using it to update the weights.
There are two main methods for updating the error derivative; they are:

Gradient scaling
Gradient clipping

Gradient scaling involves normalizing the error gradient vector such that vector norm (magnitude) equals a defined value, such as 10.
Gradient clipping involves forcing the gradient values (element-wise) to a specific minimum or maximum value if the gradient exceeded an expected range. Together, these methods are often simply referred to as gradient clipping.
It is common to use the same gradient clipping configuration for all layers of the network.

Gradient Clipping Case Study

We create a regression predictive problem using a standard regression problem generator provided by the scikit-learn library in the make_regression() function:

Target variable of the regression problem

Without scaling any data, we create now a MLP and we get the following results:

MLP with exploding gradient

We update the previous model with gradient norm scaling. This consists at setting clipnorm argument on the optimizer:

Mean Squared Error Loss with gradient norm scaling

An other solution to address the exploding gradient problem is to clip the gradient if it becomes too large or to small:

Mean Squared Error Loss with gradient clipping

Conclusion

By simply adding the clipvalue or clipnorm argument in the optimization algorithm, the example above has demonstrated the ability to remove the exploding gradient problem.

europe

Libellés

mercredi, octobre 21, 2020

Better Deep Learning - Jason Brownlee - Chapter 8

Introduction

The exploding gradient problem

Gradient Clipping Case Study

Conclusion

Aucun commentaire: