jeudi, novembre 12, 2020

Better Deep Learning - Jason Brownlee - Cyclic Learning Rate and Snapshot Ensembles

 Preamble

  • This blog post is an extract of the book "Better Deep Learning" from Jason Brownlee. 
  • This blog post is related to "Better Predictions": how to make better predictions using Cyclic Learning Rate and Snapshot Ensembles (chapter 24)
  • I write this series of blog posts because it is a way for me to memorize by writing and also to experiment the pieces of code provides in the book.
  • This book, as well as the other series of books from Jason Brownlee, are very helpful in my learning curve of Deep Learning. 
  • The information related to neural network is complex and Jason Brownlee digests it for you in a didactic, pragmatic way and with concrete examples.
Voiture de sport

Chapter 24: Cyclic Learning rate and Snapshot Ensembles

  • Model ensembles can achieve lower generalization error than single models but are challenging to develop with deep learning neural networks given the computational cost of training each single model.
  • An alternative is to train multiple model snapshots during a single training run and combine their predictions to make an ensemble prediction. A limitation of this approach is that the saved models will be similar, resulting in similar predictions and predictions errors and not offering much benefit from combining their predictions.
  • Effective ensembles require a diverse set of skillful ensemble members that have differing distribution of prediction errors. One approach to promoting a diversity of models saved during a single training run is to use an aggressive learning rate schedule that forces large changes in the model weights and, in turn, the nature of the model saved at each snapshot.
  • Snapshot ensembles combine the predictions from multiple models saved during a single training run.
  • Diversity in model snapshots can be achieved through the use of aggressively cycling the learning rate used during a single training run.
  • One approach to ensemble learning for deep learning neural networks is to collect multiple models from a single training run.
  • A key benefit of ensemble learning is in improved performance compared to the predictions from single models.
  • A limitation of collecting multiple models during a single training run is that the models may be good, but too similar. This can be addressed by changing the learning algorithm for the deep neural network to force the exploration of different network weights during a single training run that will result, in turn, with models that have differing performance. One way that this can be achieved is by aggressively changing the learning rate used during training. 
  • An approach to systematically and aggressively changing the learning rate during training to result in different network weights is referred to as Stochastic Gradient Descent with Warm Restarts or SGDR for short. This approach involves systematically changing the learning rate over training epochs, called cosine annealing.
  • The cosine annealing (annealing = recuit in French) method has the effect of starting with a large learning rate that is relatively rapidly decreased to a minimum value before dramatically increased again. The model weights are subjected to the dramatic changes during training, having the effect of using "good weights" as the starting point for the subsequent learning rate cycle, but allowing the learning algorithm to converge to a different solution.
  • "We let SGD converge M times to local minima along its optimization path. Each time the model converges, we save the weights and add the corresponding network to our ensemble. We then restart the optimization with a large learning rate to escape the current local minimum". In the original paper of the method, there is a nice graph that explains graphically the process:

Case Study

  • The first step is to establish a baseline for a multi class classification problem. We will then be able to compare the baseline with the Snapshot Ensembles.
  • The multi class classification problem:

A multi class Classification problem

  • The baseline for the MLP targeting to solve the multi class classification problem:

Model Accuracy on Train and Test Dataset over Each Training Epoch
  • The coseline annealing schedule is an example of an aggressive learning rate schedule where learning rate schedule starts high and is dropped relatively rapidly to a minimum value near zero before being increased again to the maximum.
  • So now the next step is to evaluate the cosine annealing learning schedule impact on the MLP:

MLP with Cosine Annealing Learning Schedule

  • The final step is to evaluate the performance of the Snapshot Ensembles models. We will compare the ensemble vs the single models snapshot.

Snapshot Ensemble Performance

  • End result of the experimentation: the snapshot ensemble performance achieved a performance accuracy of 81,9% vs a performance accuracy baseline of 81%.

Conclusion


Aucun commentaire: