Introduction
- This post is a summary of my notes related to the excellent book from Jason Brownlee, "Better Deep Learning".
- Only Chapter 8, related to fix exploding gradient with gradient clipping is explored.
- This is a followup of previous post related to the fixing of vanishing gradient.
Valmestroff, Lorraine, France
The exploding gradient problem
- Training a neural network can become unstable given the choice of the error function, learning rate, or even the scale of the target variable. Large updates to weights during training can cause a numerical overflow or underflow often referred as exploding gradients. The problem of exploding gradients is more common with recurrent neural networks, such as LSTM given the accumulation of gradients unrolled or over hundreds of input time steps. A common and relatively easy solution to the exploding gradients problems is to change the derivative of the error before propagating it backward through the network and using it to update the weights. Two approaches include rescaling the gradients given a chosen vector norm and clipping gradient values that exceed a preferred range. Together, these methods are referred to as gradient clipping.
- Neural networks are trained using the stochastic gradient descent optimization algorithm. This requires first the estimation of the loss on one or more training examples, then the calculation of the derivative of the loss, which is propagated backward through the network in order to update the weights. Weights are updated using a fraction of the back propagated error control by the learning rate. It is possible for the updates to the weights to be so large that the weights either overflow or underflow their numerical precision. In practice, the weights can take the value of an NaN (not a number) of Inf (infinity) when they overflow or underflow and for practical purposes the network will be useless from that point forward, forever predicting NaN values as signals flow through the invalide weights.
- Some case cases of exploding gradients:
- poor choice of learning rate that results in large weights updates
- poor choice of data preparation, allowing large differences in the target variable
- poor choice of loss function, allowing the calculation of large error values
- Exploding gradients can be avoided in general by careful configuration of the network model, such as choice of a small learning rate, scaled target variables, and a standard loss function.
- A common solution to exploding gradients is ti change the error derivative before propagating it backward through the network and using it to update the weights.
- There are two main methods for updating the error derivative; they are:
- Gradient scaling
- Gradient clipping
- Gradient scaling involves normalizing the error gradient vector such that vector norm (magnitude) equals a defined value, such as 10.
- Gradient clipping involves forcing the gradient values (element-wise) to a specific minimum or maximum value if the gradient exceeded an expected range. Together, these methods are often simply referred to as gradient clipping.
- It is common to use the same gradient clipping configuration for all layers of the network.
Gradient Clipping Case Study
- We create a regression predictive problem using a standard regression problem generator provided by the scikit-learn library in the make_regression() function:
Target variable of the regression problem
- Without scaling any data, we create now a MLP and we get the following results:
MLP with exploding gradient
- We update the previous model with gradient norm scaling. This consists at setting clipnorm argument on the optimizer:
Mean Squared Error Loss with gradient norm scaling
- An other solution to address the exploding gradient problem is to clip the gradient if it becomes too large or to small:
Mean Squared Error Loss with gradient clipping
Conclusion
- By simply adding the clipvalue or clipnorm argument in the optimization algorithm, the example above has demonstrated the ability to remove the exploding gradient problem.