Preamble
- This blog post is an extract of the book "Better Deep Learning" from Jason Brownlee.
- This blog post is related to "Better Predictions": Combine Model Parameters with average Model Weights Ensemble(Chapter 26)
Chenonceaux
Chapter 26: Combine Model Parameters with Average Model Weights Ensemble
- The model at the end of a training may not be stable or the best performing set of weights may not be usable as a final model.
- One approach to address this problem is to use an average of the weights from multiple models seen toward the end of the training run. This is called Polyak-Ruppert averaging and can be further improved by using a linearly or exponentially decreasing weighted average of the model weights. In addition to resulting in a more stable model, the performance of the averaged model weights can also result in better performance.
- Learning the weights for a deep neural network model requires solving a high-dimensional non-convex optimization problem. A challenge with solving this optimization is that there are many good solutions and it is possible for the learning algorithm to bounce around and fail to settle in on one. In the area of stochastic optimization, this is referred to as problems with the convergence of the optimization algorithm on a solution, where a solution is defined by a set of specific weights values.
- "Polyak averaging consists of averaging together several points in the trajectory through parameter space visited by an optimization algorithm."
- "The basic idea is that the optimization algorithm may leap back and forth across a valley several times without ever visiting a point near of the bottom of the valley. The average of all the locations on either side should be close to the bottom of the valley though" - Deep Learning.
- The simplest implementation of Polyak-Ruppert averaging involves calculating the average of the weights of the models over the last few trains epochs.
- This can be improved by calculating a weighted average, where more weight is applied to more recent models, which is linearly decreased through prior epochs. An alternative and more widely approach used approach is to use an exponential decay in the weighted average.
Case Study
- Multiclass Classification problem
Plot of a multi class classification samples
- Multilayer Perceptron Model: in the problem suggested by the author, the training dataset is relatively small. There is a 10:1 ratio of examples in the training dataset to the holdout dataset. This mimics a situation where we may have a vast number of unlabeled examples and a small number of labeled examples with which to train a model. We will create 1100 data points. The problem will be trained for the first 100 points, 1000 will be held back in a test dataset, unavailable to the model.
Plot of Learning Curves of Accuracy on Train and Test Datasets
- The preliminary step before working on a Model Weight Ensemble consists at saving model weights to file during training, and later combine the weights from the saved models in order to make a final model.
- So, after this first step that establishes a baseline, the author suggest to create a new model from multiple existing models with the same architecture.
- Each model has a get_weights() function that returns a list of arrays, one for each layer in the model.
- After saving, loading and fitting with the weights the model, we are now ready to do some predictions.
Predictions with an Average Model Weight Ensemble
Predictions with a Linear Weighted Average Ensemble
Accuracy of a linear Weighted Average Ensemble - I notice that the linear weight weighted average ensemble is not better the average model weighted ensemble. This is not what I was expected. The results in fact vary given the stochastic nature of the learning algorithm.
Predictions with an Exponentially Decreasing Average Ensemble
- I notice that the linear weight weighted average ensemble is not better the average model weighted ensemble. This is not what I was expected. The results in fact vary given the stochastic nature of the learning algorithm.
- Creating a model with the average of the weights from models observed towards the end of a training can result in a more stable and sometimes better-performing solution.
- This was the last chapter of "Better Deep Learning" book from Jason Brownlee.
- I am very grateful to Jason Brownlee as this book helped me very much in my understanding of the tuning of neural networks.