samedi, novembre 14, 2020

Better Deep Learning - Jason Brownlee - Models Weights Ensembles

 Preamble

  • This blog post is an extract of the book "Better Deep Learning" from Jason Brownlee. 
  • This blog post is related to "Better Predictions": Combine Model Parameters with average Model Weights Ensemble(Chapter 26)

Chenonceaux

Chapter 26: Combine Model Parameters with Average Model Weights Ensemble

  • The model at the end of a training may not be stable or the best performing set of weights may not be usable as a final model. 
  • One approach to address this problem is to use an average of the weights from multiple models seen toward the end of the training run. This is called Polyak-Ruppert averaging and can be further improved by using a linearly or exponentially decreasing weighted average of the model weights. In addition to resulting in a more stable model, the performance of the averaged model weights can also result in better performance.
  • Learning the weights for a deep neural network model requires solving a high-dimensional non-convex optimization problem. A challenge with solving this optimization is that there are many good solutions and it is possible for the learning algorithm to bounce around and fail to settle in on one. In the area of stochastic optimization, this is referred to as problems with the convergence of the optimization algorithm on a solution, where a solution is defined by a set of specific weights values.
  • "Polyak averaging consists of averaging together several points in the trajectory through parameter space visited by an optimization algorithm."
  • "The basic idea is that the optimization algorithm may leap back and forth across a valley several times without ever visiting a point near of the bottom of the valley. The average of all the locations on either side should be close to the bottom of the valley though" - Deep Learning.
  • The simplest implementation of Polyak-Ruppert averaging involves calculating the average of the weights of the models over the last few trains epochs.
  • This can be improved by calculating a weighted average, where more weight is applied to more recent models, which is linearly decreased through prior epochs. An alternative and more widely approach used approach is to use an exponential decay in the weighted average.

Case Study

  • Multiclass Classification problem

Plot of a multi class classification samples

  • Multilayer Perceptron Model: in the problem suggested by the author, the training dataset is relatively small. There is a 10:1 ratio of examples in the training dataset to the holdout dataset. This mimics a situation where we may have a vast number of unlabeled examples and a small number of labeled examples with which to train a model. We will create 1100 data points. The problem will be trained for the first 100 points, 1000 will be held back in a test dataset, unavailable to the model.

Plot of Learning Curves of Accuracy on Train and Test Datasets

  • The preliminary step before working on a Model Weight Ensemble consists at saving model weights to file during training, and later combine the weights from the saved models in order to make a final model.
  • So, after this first step that establishes a baseline, the author suggest to create a new model from multiple existing models with the same architecture.
  • Each model has a get_weights() function that returns a list of arrays, one for each layer in the model.
  • After saving, loading and fitting with the weights the model, we are now ready to do some predictions.

Predictions with an Average Model Weight Ensemble


Plot of single Model test Performance and Model Weight Ensemble

Predictions with a Linear Weighted Average Ensemble


Accuracy of a linear Weighted Average Ensemble 
  • I notice that the linear weight weighted average ensemble is not better the average model weighted ensemble. This is not what I was expected. The results in fact vary given the stochastic nature of the learning algorithm.
Predictions with an Exponentially Decreasing Average Ensemble


Accuracy of Single And Ensemble Model Weight Ensemble with an Exponential Decay

Conclusion

  • Creating a model with the average of the weights from models observed towards the end of a training can result in a more stable and sometimes better-performing solution.
  • This was the last chapter of "Better Deep Learning" book from Jason Brownlee. 
  • I am very grateful to Jason Brownlee as this book helped me very much in my understanding of the tuning of neural networks.

vendredi, novembre 13, 2020

Better Deep Learning - Jason Brownlee - Stacked Generalization Ensemble

 Preamble

  • This blog post is an extract of the book "Better Deep Learning" from Jason Brownlee. 
  • This blog post is related to "Better Predictions": how to combine predictions with Stacked Generalization Ensembles (chapter 25)

Alpine A310

Chapter 25: Learn to combine Predictions with Stacked Generalization
  • Model averaging is an ensemble technique where multiple submodels contribute equally to a combined prediction. Model averaging can be improved by weighting the contributions of each submodel to the combined prediction by the expected performance of the submodel.This can be extended further by training an entirely new model to learn how to best combine the contributions from each submodel. This approach is called stacked generalization, or stacking for short, and can result in better predictive performance than any single contributing model.
  • Stacked generalization is an ensemble method where a new model learns how to best combine the predictions from multiple existing models.
  • Stacked generalization (or stacking) (Wolpert, 1992) is a different way of combining multiple models, that introduces the concept of a meta learner. Although an attractive idea, it is less widely used than bagging and boosting. Unlike bagging and boosting, stacking may be (and normally is) used to combine models of different types. The procedure is as follows:

    1. Split the training set into two disjoint sets.
    2. Train several base learners on the first part.
    3. Test the base learnerson the second part.
    4. Using the predictions from 3) as the inputs, and the correct responses as the outputs, train a higher level learner.
  • Note that steps 1) to 3) are the same as cross-validation, but instead of using a winner-takes-all approach, we combine the base learners, possibly nonlinearly.

Case Study

  • We are going to experiment the technique of "Stacked Generalization Ensembles" on a multi class classification problem. We will try to solve this classification problem with a classical MLP to establish a baseline, then in a second phase we will experiment the technique of Stacked Generalization Ensembles.

Plot of samples

  • Then we run our MLP on the samples. The author suggests to create a sample of 1100 data points, the model being trained only on the first 100 points and the remaining 1000 points are held back in a test dataset.

Learning curves of Model Accuracy on Train and Test Dataset

First experimentation: a separate stack model 

  • We start by training multiple submodels and saving them to files for later use in our stacking ensembles.
  • We then train a meta-learner that will best combine the predictions from the submodels and we will see if the performance is better.
    • The author prepare a training dataset for the meta-learner by providing examples from the test set to each of the submodels and collecting the predictions: dstack() and reshape() NumPy functions are used for combining arrays.
    • The meta-learner is trained with a simple logistic regression algorithm from the scikit-learn library. The LogisticRegression class supports multi class classification (more than two classes) using a multinomial scheme.  
  • Once fit the stacked model is used to make predictions on new data.

Fitting a logistic regression stacking model

  • From the image above, the result of the experimentation is that the stacked model with a performance of 83,4 % outperforms all single model accuracy.

Second experimentation: an integrated stack model

  • When using neural networks as submodels, it may be desirable to use a neural network as a meta-learner. Specifically, the sub-networks can be embedded in a larger multi-headed neural network that then learns how to best combine the predictions from each input submodel.
  • The graph of the stacked model:

Stacked Generalization Ensemble of Neural Network Models

  • The result of this stacked model is better that the separated one: 83,8% of accuracy compared to 83,4%.

Conclusion

  • Stacked generalization is an ensemble method where a new model learns how to best combine the predictions from multiple existing models.
  • We will experiment in the next chapter how to combine model parameters with average model weights ensemble.
  • Big thanks to Jason Brownlee for helping me to understand these notions of Ensemble Learning.

jeudi, novembre 12, 2020

Better Deep Learning - Jason Brownlee - Cyclic Learning Rate and Snapshot Ensembles

 Preamble

  • This blog post is an extract of the book "Better Deep Learning" from Jason Brownlee. 
  • This blog post is related to "Better Predictions": how to make better predictions using Cyclic Learning Rate and Snapshot Ensembles (chapter 24)
  • I write this series of blog posts because it is a way for me to memorize by writing and also to experiment the pieces of code provides in the book.
  • This book, as well as the other series of books from Jason Brownlee, are very helpful in my learning curve of Deep Learning. 
  • The information related to neural network is complex and Jason Brownlee digests it for you in a didactic, pragmatic way and with concrete examples.
Voiture de sport

Chapter 24: Cyclic Learning rate and Snapshot Ensembles

  • Model ensembles can achieve lower generalization error than single models but are challenging to develop with deep learning neural networks given the computational cost of training each single model.
  • An alternative is to train multiple model snapshots during a single training run and combine their predictions to make an ensemble prediction. A limitation of this approach is that the saved models will be similar, resulting in similar predictions and predictions errors and not offering much benefit from combining their predictions.
  • Effective ensembles require a diverse set of skillful ensemble members that have differing distribution of prediction errors. One approach to promoting a diversity of models saved during a single training run is to use an aggressive learning rate schedule that forces large changes in the model weights and, in turn, the nature of the model saved at each snapshot.
  • Snapshot ensembles combine the predictions from multiple models saved during a single training run.
  • Diversity in model snapshots can be achieved through the use of aggressively cycling the learning rate used during a single training run.
  • One approach to ensemble learning for deep learning neural networks is to collect multiple models from a single training run.
  • A key benefit of ensemble learning is in improved performance compared to the predictions from single models.
  • A limitation of collecting multiple models during a single training run is that the models may be good, but too similar. This can be addressed by changing the learning algorithm for the deep neural network to force the exploration of different network weights during a single training run that will result, in turn, with models that have differing performance. One way that this can be achieved is by aggressively changing the learning rate used during training. 
  • An approach to systematically and aggressively changing the learning rate during training to result in different network weights is referred to as Stochastic Gradient Descent with Warm Restarts or SGDR for short. This approach involves systematically changing the learning rate over training epochs, called cosine annealing.
  • The cosine annealing (annealing = recuit in French) method has the effect of starting with a large learning rate that is relatively rapidly decreased to a minimum value before dramatically increased again. The model weights are subjected to the dramatic changes during training, having the effect of using "good weights" as the starting point for the subsequent learning rate cycle, but allowing the learning algorithm to converge to a different solution.
  • "We let SGD converge M times to local minima along its optimization path. Each time the model converges, we save the weights and add the corresponding network to our ensemble. We then restart the optimization with a large learning rate to escape the current local minimum". In the original paper of the method, there is a nice graph that explains graphically the process:

Case Study

  • The first step is to establish a baseline for a multi class classification problem. We will then be able to compare the baseline with the Snapshot Ensembles.
  • The multi class classification problem:

A multi class Classification problem

  • The baseline for the MLP targeting to solve the multi class classification problem:

Model Accuracy on Train and Test Dataset over Each Training Epoch
  • The coseline annealing schedule is an example of an aggressive learning rate schedule where learning rate schedule starts high and is dropped relatively rapidly to a minimum value near zero before being increased again to the maximum.
  • So now the next step is to evaluate the cosine annealing learning schedule impact on the MLP:

MLP with Cosine Annealing Learning Schedule

  • The final step is to evaluate the performance of the Snapshot Ensembles models. We will compare the ensemble vs the single models snapshot.

Snapshot Ensemble Performance

  • End result of the experimentation: the snapshot ensemble performance achieved a performance accuracy of 81,9% vs a performance accuracy baseline of 81%.

Conclusion


mercredi, novembre 11, 2020

Better Deep Learning - Jason Brownlee - Horizontal Voting Ensembles

 Preamble

  • This post is an extract of the book "Better Deep Learning" from Jason Brownlee. 
  • This post is related to "Better Predictions": how to make better predictions using Horizontal Voting Ensembles (chapter 23)
Schönnebrunn

Chapter 23: Models from Contiguous Epochs with Horizontal Voting Ensembles

  • The horizontal voting ensemble is a simple method where a collection of models saved over contiguous training epochs towards the end of a training run are saved and used as an ensemble that results in more stable and better performance on average than randomly choosing a single final model.
  • It is challenging to choose a final neural network model that has high variance on a training dataset.
  • Horizontal voting ensembles provide a way to reduce variance and improve average model performance for models with high variance using a single training run.
  • Ensemble learning combines the predictions from multiple models.
  • An alternative source of models that may contribute to an ensemble are the state of a single model at different points during training.
  • The method involves using multiple models from the end of a contiguous blocks of epochs before the end of training in an ensemble to make predictions.  The approach was developed specifically for those predictive modeling problems where the training dataset is relatively small compared to the number of predictions required by the model.

Case study

  • The case study is the same as the one used in the post "Resampling Ensembles": a multi class classification problem.
  • In a second step we create 50 models that we saved, for only the last 50 epochs out of 1000,  with the help of h5py library into a directory.
  • The last step consists of loading the 50 models so that we use them in a horizontal voting ensemble.

Single Model Accuracy (blue dots) vs Accuracy of Ensembles of Varying Size with a Horizontal Voting Ensemble

Conclusion

  • The Horizontal Voting Ensemble experimentation did not demonstrate clearly that a sized ensemble outperforms sharply a randomly selected model.
  • We will experiment in the next chapter named "Cyclic Learning rate and Snapshot Ensembles" an other technique of ensembles.

dimanche, novembre 08, 2020

Better Deep Learning - Jason Brownlee - Resampling Ensembles

 

Preamble

  • This post is an extract of the book "Better Deep Learning" from Jason Brownlee. 
  • This post is related to "Better Predictions": how to make better predictions using Resampling Ensembles.
  • Best sentences from the book are extracted as a reminder for me in the future.
  • I run all code examples delivered with the book on my iMac environment (Anaconda + Spyder) and the results of my experimentations are reported in this post.

Eglise Sofia - Bulgarie

Chapter 22: Fit Models on Different Samples with Resampling Ensembles

  • One way to achieve differences between models is to train each model on a different subset of the available training data. Models are trained on different subsets of the training data naturally through the use of resampling methods such as cross-validation and the bootstrap, designed to estimate the average performance of the model generally on unseen data. The models used in this estimation process can be combined in what is referred to as a resampling-based ensemble, such as a cross-validation ensemble or a bootstrap aggregation (or bagging) ensemble.
  • Thanks to the book of Jason Brownlee, this chapter will help us to know more about how to estimate model performance using random-splits and develop an ensemble from the models. 
  • We will see how to estimate performance using 10-fold cross-validation and develop a cross-validation ensemble. 
  • And finally we will see how to estimate performance using then bootstrap and combine models using a bagging ensemble.
  • Effective ensembles require members that disagree. Each member must have skill (e.g. perform better than random chance), but ideally, perform well in different ways. Technically, we can say that we prefer ensemble members to have low correlation in their predictions, or prediction errors.
  • Multiple models are fit using slightly different perspectives on the training data and, in turn, make different errors and often more stable and better predictions when combined. We can refer to these methods generally as data resampling ensembles. A benefit of this approach is that resampling methods may be used that do not make use of all examples in the training data set. Any examples that are not used to fit the model can be used as a test dataset to estimate the generalization error of the chosen model configuration. There are three popular methods that we could use to create a resampling ensemble; they are:
    • Random splits: the dataset is repeatedly sampled with a random split of the data into train and test sets.
    • k-fold Cross-Validation: the dataset is split into k equally sized folds, k models are trained and each fold is given an opportunity to be used as the holdout set where the model is trained on all remaining folds.
    • Boostrap Aggregation: random samples are collected with replacement and examples not included in a given sample are used as the test set.

Case Study

  • The case is how to use the resampling ensemble to reduce the variance of an MLP on a simple multi class problem:
A multi class Classification problem to be submitted to a neural network

  • We define a simple Multilayer Perceptron Model in oder to learn the problem. The model will predict a vector with three elements with the probability that the sample belongs to each of the three classes.
  • 90% of the data is used for training and 10% of data for the test set. The author explains that it is because it is noisy problem and a well-performing model requires as much data as possible to learn the complex classification problem.
MLP Model Accuracy on Train and Test Dataset without Random Splits Ensemble

  • The next step consists of using the technique of random splits ensemble. For that purpose, the author suggests to combine multiple models trained on the random splits with the expectation that performance of the ensemble is likely to be more stable and better than the single average model.
  • The author suggests to generate 10 times more sample points from the problem domain and hold them back as an unseen dataset.
Random Splits Ensemble Performance on the classification problem

  • As a next step, we try the technique of Cross-Validation Ensemble. This approach is designed to be less optimistic. 
  • The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. A typical value for k is 10.
  • KFold class from scikit-learn can split the dataset into k folds.

Accuracy of a MLP Multiclass Classification problem with Cross-Validation Resampling

  • As a final step, we experiment the Bagging Ensemble.
  • A limitation of random splits and k-fold cross-validation from the perspective of ensemble learning is that the models are very similar. The bootstrap method is a statistical technique for estimating quantities about a population by averaging estimates from multiple small data samples. Importantly, samples are constructed by drawing observations from a large data sample one at a time and returning them to the data sample after they have been chosen. This allows a given observation to be included in a given call sample more than once. This approach to sampling is called sampling with replacement.
  • The bootstrap is a robust method for estimating model performance. It does suffer a little from an optimistic bias, but is often almost as accurate as k-fold cross-validation in practice.
  • Generally, use of the bootstrap method in ensemble learning is referred to as bootstrap aggregation or bagging.

Accuracy of a MLP Multiclass Classification problem with Bagging Ensemble

Conclusion

  • During this journey across ensembles, we experiment different kind of ensembles:
    • Random Split Ensembles
    • Cross-validation Ensemble
    • Bagging Ensemble (or Bootstrap Aggregating)
  • Big thanks to Jason Brownlee for sharing his expertise in this field of deep learning.





samedi, novembre 07, 2020

Better Deep Learning Jason Brownlee Weighted Average Ensemble

 Preamble

  • This post is an extract of the book "Better Deep Learning" from Jason Brownlee. 
  • This post is related to "Better Predictions": how to make better predictions using Weighted Average Ensembles.

Lisbonne

Chapter 21: Contribute Proportional to Trust with Weighted Average Ensemble

  • It is well known that a combination of many different predictions can improve prediction.
  • Learning of continuous valued function using neural network ensembles (committees) can improve accuracy, reliable estimation of the generalization error and active learning.
  • Most often the networks in the ensemble are trained individually and their prediction are combined. This combination is usually done by majority (in classification) or by simple averaging (in regression), but one can also use a weighted combination of the networks.
  • A modeling averaging ensemble combines the prediction from each model equally and often results in better performance on average than a given single model. Sometimes there are very good models that we wish to contribute more to an ensemble prediction, and perhaps less skillful models that may be useful but should contribute less to an ensemble prediction. A weighted average ensemble is an approach that allows multiple models to contribute to a prediction in proportion to their trust or estimated performance.
  • A weighted ensemble is an extension of a model averaging ensemble where the contribution of each member to the final prediction is weighted by the performance of the model.

Case Study

  • A small multi class classification problem will be used to demonstrate the weighted averaging ensemble.

Three Classes 

  • The problem is a multi class classification problem, and we will model it using a softmax activation function on the output layer. This means that the model will predict a vector with three elements with the probability that the sample belongs to each of the three classes.

MLP without Averaging Ensemble 

  • As a first step, we develop a simple model averaging ensemble before we look at developing a weighted average ensemble:

Single Model Accuracy (blue dots) and Accuracy of Ensembles of Increasing Size (orange line)

  • Now we can extend, in a second step, with a weighted model of ensemble:
    • The model averaging ensemble allows each ensemble member to contribute an equal amount to the prediction of the ensemble. In this second step experimentation, the contribution of each ensemble member is weighted by a coefficient that indicates the trust or expected performance of the model. 
    • We can use efficient NumPy functions to implement the weighted sum such as einsum() or tensor dot().
    •  The experimentation consists of finding the best combination of weighted ensemble:
Performance of a Weighted Average MLP Ensemble
  • As a final step, we will use a directed optimization process, with the help of the SciPy library which provides the differential_evolution() function:

Performance of a Weighted Average MLP Ensemble

  • As a final note of these experimentations, I notice that the weighted average model out-perform the model averaging ensemble and individual models.

Conclusion

  • During this chapter, we were able, thanks to the explanation and code provided by Jason Brownlee, to evaluate  the model averaging ensemble and the weighted average ensembles.
  • The model averaging ensembles are limited because each of them are contributing equally to the global prediction.
  • The weighted average ensembles provide a way for each model to contribute to the prediction with a weight that is proportional to the trust or performance of the model.




vendredi, novembre 06, 2020

Kalenji Kip Run

  • Après la lecture des vidéos du site "La Clinique du Coureur", j'ai donc acheté les chaussures Kalenji Kip Run. (70 €)
  • Ce sont des chaussures minimalistes avec un indice minimaliste de 60% et qui pèsent 248g (avec semelles d'origine)
  • A comparer aux Mizuno Wave Rider 23 qui ont un indice minimaliste de 32% pour un poids de 338 g (avec semelles orthopédiques). 417 g avec les semelles avec appui-rétro capital en silicone.
  • Drop inférieur à 8 mm

Kalenji Kip Run taille 45


  • C'est donc la quatrième paire de chaussures de running cette année.
  • La Clinique du Coureur dit qu'un marathonien avec une performance 4h20 peut gagner 20 minutes en courant avec des chaussures minimalistes.
  • Suite à l'IRM du 7/11/2020, le diagnostic est une bursite inter-capito-métatarsienne des deux premiers espaces.
  • J'ai donc racheté des semelles avec appui rétro-capital.

mercredi, novembre 04, 2020

"Better Deep Learning" - Jason Brownlee - Better Predictions -

 Preamble

  • This post is an extract of the book "Better Deep Learning" from Jason Brownlee. 
  • This post is related to "Better Predictions": how to make better predictions using model ensembles.
  • In this post, you get the result of the code examples that come together with the book. The code examples are a great way to put in practice the theoretical part of neural network. This is a big advantage of the series of Deep Learning book written by Jason Brownlee.

Mer de glace 


Chapter 19: Reduce Model Variance with Ensemble Learning

  • Deep learning neural networks are non linear methods. This means that they can learn complex nonlinear relationships in the data.
  • A successful approach to reducing the variance of neural network models is to train multiple models instead of a single model and to combine the predictions from these models. This is called ensemble learning and not only reduces the variance of predictions but also can result in predictions that are better than any single model.
  • A solution to the high variance of neural networks is to train multiple models and combine their predictions.
  • Combining the predictions from multiple neural networks adds a bias that in turn counters the variance of a single trained neural network model.
  • Perhaps the oldest and still most commonly used ensembling approach for neural networks is called a committee of networks. A collection of networks with the same configuration and different initial random weights is trained on the same dataset. Each model is then used to make a prediction and the actual prediction is calculated as the average of the predictions.
  • Ensembles may be as small as three, five, or ten trained models.
  • Varying Training Data: a natural way to reduce the variance and hence increase the prediction accuracy of a statistical learning method is to take many training sets from the population, build a separate prediction model using each training set, and average the resulting predictions.
  • There are more sophisticated methods for stacking models, such as boosting where ensemble models are added one at a time in order to correct the mistakes of prior models. Another combination that is a little bit different is to combine the weights of multiple neural networks with the same structure. The weights of multiple networks can be averaged, to hopefully result in a new single model that has better overall performance than any original model. This approach is called model weight averaging.

Chapter 20: Combine Models From Multiple Runs with Model Averaging Ensemble

  • Model averaging is an ensemble learning technique that reduces the variance in a final neural network model, sacrificing spread (and possibly better scores) in the performance of the model for a confidence in what performance to expect from the model.
  • Deep learning neural network models are nonlinear methods that learn via a stochastic training algorithm. This means that they are highly flexible, capable of learning complex relationships between variables and approximating any mapping function, given enough ressources. A downside of this flexibility is that the models suffer high variance. This means that the models are highly dependent on the specific training data to train the model and on the initial conditions (random initial weights) and serendipity during the training process. The result is a final model that makes different predictions each time the same model configuration is trained on the same dataset.
  • The high variance of the approach can be addressed by training multiple models for the problem and combining their predictions.

Case study

  • First we are defining a problem which is a multi class classification problem. We will then look in a second step for a model to address this problem. As it is a multi class classification problem, we will use a softmax activation function on the output layer.

Multi class (3 classes) wit Points colored by class value
  • Secondly we try to resolve the suggested multi class classification problem with a MLP neural network. Only a stochastic training algorithm can solve the relationships between variables.

Cross-Entropy Loss and Accuracy for training and test dataset over 200 epochs

  • Thirdly we are going to experiment the variance of the previous model. For this the code proposed by Jason Brownlee repeats the fit and evaluation of the same model configuration on the same dataset and summarizes the model:

Raw Data, Box and Whisker Plot, Histogram for accuracy over 30 repeats

  • In the step #4, we can use model averaging to both reduce the variance of the model and possibly reduce the generalization error of the model. The piece of code will try to find the correct number of Ensembles:

Line Plot of Ensemble Size Versus Model Test Accuracy

  • For the final step #5, we are going to repeat evaluation experiment to use an ensemble of five models instead of a single model and compare the distributions of scores. So five models are fit and evaluated and this process is repeated 30 times:

Repeated Evaluation of a model average ensemble

  • This last experimentation confirms the theory of ensemble averaging which relies on two properties of artificial neural networks:
    • in any network, the bias can be reduced at the cost of increased variance
    • in a group of networks, the variance can be reduced at no cost of bias

Conclusion

  • Thanks to the code provided in the book from Jason Brownlee, we were able to test model averaging.
  • Ensemble Learning is a technique that can be used to reduce the variance of deep learning neural network.


mardi, novembre 03, 2020

"Better Deep Learning" Jason Brownlee Early stopping topic

Preamble

  • This post is an excerpt from Chapter 18 of the book "Better Deep Learning" from Jason Brownlee.
  • The series of post related to Deep Learning are here to help me in the future when working on Machine Learning so that it's easy to get back and refreshed on specific topics such as this one.
  • Each post is written in parallel with the reading of the book. I also report in this post the different case study provided in the book.

Téléphérique Aiguille du Midi


Chapter 18: Halt Training at the Right Time with Early Stopping

  • A major challenge in trading neural networks is how long to train them. Too little training will mean that the model will underfit the train and the test sets. Too much training will mean that the model will overfit the training dataset and have poor performance on the test set. A compromise is to train on the training dataset but to stop training at the point where performance on a validation dataset starts to degrade. This simple, effective, and widely used approach to training neural networks is called early stopping.
  • It is common to split the training dataset and use a subset, such as 30%, as a validation dataset used to monitor performance of the model during the training. This validation set is not used to train the model. It is also common to use the loss on a validation dataset as the metric to monitor, although you may also use prediction error in the case of regression, or accuracy in the case of classification.
  • At the time that training is halted, the model is known to have slightly worse generalization error than a model at a prior epoch.
  • Perhaps a simple approach is to always save the model weights if the performance of the model on a holdout dataset is better than at the previous epoch.
  • Early stopping could be used with k-fold cross-validation, although it is not recommended.
  • A callback is a snippet of code that can be executed at a specific point during training, such as before or after training, an epoch or a batch. They provide a way to execute code and interact with the training model process automatically. Callbacks can be provided to the fit() function via the callbacks argument.
  • Keras supports the early stopping of training via a callback called EarlyStopping.
  • Saving and loading models requires that HDF5 support has been installed on your workstation.

Early Stopping Case Study

  • We are using a classical binary classification problem as guinea pig to test early stopping on MLP:

A binary classification problem with two classes

  • The first step of the case study is to establish a baseline. We run a classical MLP without any means to address the overfitting:

MLP with Overfitting

  • As for the second step, we introduce early stopping but the performance result is worse:

MLP on classification problem with early Stopping

  • So as the third step, we try to add a new parameter which is "patience"=200, which means that we will wait 200 epochs more before training is stopped:

Plot of Cross-Entropy Loss and Accuracy of a MLP with Early Stopping and patience

  • As a final step we can try to search what is the best moment to stop the training by using a ModelCheckkpoint callback. In this case we are interested in saving the model with the best accuracy on the dataset:

Plot of Cross-Entropy and Accuracy of a MLP with Early Stopping and Checkpointing


Conclusion

  • We were able to test the ability to early stop a model evaluation in order to address the overfitting problem.
  • Up to Keras API and callback, we are able to monitor the performance of a MLP network so that we can stop the training of the model before the model overfits.
  • Thanks to Jason Brownlee for this excellent book. As always the code examples are perfect without any bug.