samedi, octobre 31, 2020

Mizuno Wave Rider 23

  • Dernière paire de chaussures achetée cette semaine sur conseil du kinésithérapeute : une pointure de plus pour que le pied soit plus à l'aise, soit du 45.
  • J'espère ainsi que cela m'aidera à me débarrasser de ma capsulite sous le deuxième métatarse du pied droit.
  • Je cours avec des semelles faites après analyse de ma course chez un podologue.

Mizuno Wave Rider 23
  • La première paire de Mizuno, je l'ai acheté en 2006, soit il y a 14 ans. 
  • C'est déjà la troisième paire de chaussure de running cette année. Après une première paire de Brooks, puis une deuxième paire de Mizuno, et enfin cette dernière paire de Mizuno Wave Rider 23.
  • Drop de 10 à 13 mmm
  • Poids de 338 g avec semelles orthopédiques
  • Poids de 417g avec semelles avec appui rétro-capital
  • Poids de la semelle rétro-capitale = 121g

vendredi, octobre 30, 2020

"Better Deep Learning" Jason Brownlee Chapter 15

 Preamble

  • This post is an extract of the book "Better Deep Learning" from Jason Brownlee. This post is about Force small weights with weights constraints described in chapter 15 of the book.
    Col des Montets

Chapter 15: Force Small Weights with Weights constraints

  • Unlike weight regularization, a weight constraint is a trigger that checks the size or magnitude of the weights and scales them so that they are all below a pre-defined threshold. The constraint forces weights to be small and can be used instead of weight decay and in conjunction with more aggressive network configurations, such as very large learning rates.
  • Weight penalties encourage but do not require neural networks to have small weights.
  • Weights constraints, such as L2 norm and maximum norm, can be used to force neural networks to have small weights during training.
  • Weights constraints can improve generalization when used in conjunction with other regularization methods like dropout.
  • An alternate solution to using a penalty for the size of a network weights is to use a weight constraint. A weight constraint is an update to the network that checks the size of the weights (e.g. their vector norm), and if the size exceeds a predefined limit, the weights are rescaled so that their size is below the limit or between a range.
  • Although dropout alone gives significant improvements, using dropout along with weight constraint regularization provides a significant boost over just using dropout.
  • The use of a weight constraint allows you to be more aggressive during the training of the network. Specifically, a larger learning rate can be used, allowing the network to, in turn, make larger updates to the weights each update.
  • Using a constraint rather than a penalty prevents weights from growing very large no matter how large the proposed weight-update is. This makes it possible to start with a very large learning rate which decays during learning, thus allowing a far more thorough search of the weight-space than methods that start with small weights and use a small learning rate.
  • The Keras API supports weight constraints. The constraints are specified per-layer, but applied and enforced per-node within the layer. Using a constraint generally involves setting the kernel_constraint argument on the layer for the input weights and the bias_constraint for the bias weights.

Case study

  • For the example, a standard binary classification problem that defines two semi-circles of observations is used. One semi-circle for each class. Then we will use weight constraints to reduce overfitting.

Dataset showing the class value of each sample

  • Then we run a classical MLP on the dataset of 100 samples:

Line plots of Accuracy on Train and Tests datasets while training showing an overfit

  • Finally we apply the constraint on the weights by setting the kernel_constraint:
    • model.add(Dense(500, input_dim=2, activation='relu', kernel_constraint=unit_norm()))

Line plots of accuracy on train and tests Datasets While Training With Weight Constraints

Conclusion

  • Thanks to Jason Brownlee, I was able to test the weight constraint mechanism such as the L2 norm and maximum norm and demonstrate the effect on overfitting.

dimanche, octobre 25, 2020

Courbe VO2Max - Une courbe qui fait plaisir malgré le confinement

  •  A des fins de mémoire, des statistiques sur ma VO2max ainsi que le nombre de kilomètres parcourus à ce jour.
Courbe VO2max de novembre 2019 à octobre 2020

  • Malgré le confinement, le VO2 max poursuit sa progression ce qui est encourageant.
  • Le nombre de kilomètres est moindre par rapport à 2019, ce qui est logique eu égard au confinement et à ma capsulite deuxième métatarse du pied droit:
    Courbe des kilomètres parcourus en 2020 de janvier 2020 à octobre 2020


"Better Deep Learning" - Jason Brownlee (chapter 12 to 14)

 Preamble

  • This post is an extract of the book "Better Deep Learning" from Jason Brownlee. The post is about Fix Overfitting with Regularization described in chapter 12 of the book, Penalize large weights with weight regularization described in chapter 13 and Sparse representations with activity regularization in chapter 14.
  • The post is a collection of the sentences that ring my bell and which I find useful to remember: the theory.
  • The post also describes the results of running the code provided in the book: the practice
  • You have the theory and the practice, that it is what I like in the books of Jason Brownlee.
Alpes: vue depuis l'aiguille du midi.

 

Chapter 12: Fix overfitting with regularization

  • Training a deep neural network that can generalize well to new data is a challenging problem. A model with too little capacity cannot learn the problem, whereas a model with too much capacity can learn it too well and overfit the training dataset. Both cases result in a model that does not generalize well. A modern approach to reducing generalization error is to use a larger model that may be required to use regularization during training that keeps the weights of the model small.
  • Underfitting can easily be addressed by increasing the capacity of the network, but overfitting requires the use of specialized techniques
  • Regularization methods like weight decay provide an easy way to control overfitting for large neural networks models.
  • A modern recommendation for regularization is to use early stopping with dropout and a weight constraint.
  • The challenge in machine learning is that we must perform well on new, previously unseen inputs, not just those on which the model was trained. The ability to perform well on previously unobserved inputs is called generalization.
  • A model fit can be considered in the context of the bias-variance trade-off. An underfit model has high bias and low variance. An overfit model has low bias and large variance.
  • A benefit of very deep neural networks is that their performance continues to improve as they are fed larger and larger datasets.
  • A problem is said to be ill-posed if small changes in the given information cause large changes in the solution. This instability with respect to the data makes solutions unreliable because small measurement errors or uncertainties in parameters may be greatly magnified and lead to wildly different responses. The idea behind regularization is to use supplementary information to restate an ill-posed problem in a stable form.
  • Regularization methods:
    • weight regularization: penalize the model during the training based on the magnitude of the weights
    • activity regularization: penalize the model during training based on the magnitude of the activations
    • weight constraint: contain the magnitude of weights to be within a range or below a limit
    • dropout: probabilistically remove inputs during training
    • noise: add statistical noise to inputs during training
    • early stopping: monitor model performance on a validation set and stop training when performance degrades

Chapter 13: Penalize large weights with weight regularization

  • Neural networks learn a set of weights that best map inputs to outputs. A network with large network weights can be a sign of an unstable network where small changes in the input can lead to large changes in the output. This can be sign that the network has overfit the training dataset and will likely perform poorly when making predictions on new data. A solution to this problem is to update the learning algorithm to encourage the network to keep the weights small. This is called weight regularization and it can be used as a general technique to reduce overfitting of the training  dataset and improve the generalization of the model.
  • Large weights in a neural network are a sign of a more complex network that has overfit the training data.
  • Penalizing a network based on the size of the network weights during training can reduce overfitting.
  • An L1 or L2 vector norm penalty can be added to the optimization of the network to encourage smaller weights.
  • Simpler model are less likely to over-fit than complex ones.
  • Remember, that when we train a neural network, we minimize a loss function, such as the log loss in classification or mean squared error in regression. In calculating the loss between the predicted and expected values in a batch, we can add the current size of all weights in the network or add in a layer to this calculation. This is called a penalty because we are penalizing the model proportional to the size of the weights in the model.
  • Larger weights result in a larger penalty, in the form of a larger loss score.
  • Smaller weights are considered more regular or less specialized and as such, we refer to this penalty as weight regularization.
  • The addition of a weight size penalty or weight regularization to a neural network has the effect of reducing generalization error and of allowing the model to pay less attention to less relevant input variables.
  • To calculate the size of the weights, they are two approaches:
    • calculate the sum of the absolute values of the weights, called the L1 norm
    • calculate the sum of the squared values of the weights, called the L2 norm
  • The use of L2 in linear and logistic regression is often referred to as Ridge Regression.
  • The weights may be considered a vector and the magnitude of a vector is called its norm. As such, penalizing the model based on the size of the weights is also referred to as weight or parameter norm penalty. It is possible to include both L1 and L2 approaches to calculating the size of the weights as the penalty. This is akin to the use of both penalties used in the Elastic Net algorithm for linear and logistic regression. The L2 approach is perhaps the most used and is traditionally  referred to as weight decay in the field of neural networks. It is called a shrinkage in statistics, a name that encourages you to think of the impact of the penalty on the model weights during the learning process.
  • Recall that each node has input weights and a bias weight.
  • When using weight regularization, it is possible to use larger networks with less risk of overfitting. A good configuration strategy may be to start with larger networks and use weight decay.
  • A weight regularizer can be added to each layer when the layer is defined in KERAS model. This is achieved by setting the kernel_regularizer argument on each layer.

Case Study

The case study aims at reducing the overfitting of an MLP network.
Moons dataset showing the class value of each sample

Overfitting the MLP

Overfitting correction of the MLP with L2 normalization

Grid searching of the best weight regularization parameter

Chapter 14: Sparse representations with activity regularization

  • The output of a hidden layer within the network represent the learned features by the model at that point in the network.
  • There is a field of study focused on the efficient and effective automatic learning of features, often investigated by having a network reduce an input to a small learned feature before using a second network to reconstruct the original input from the learned feature. Models of this type are called auto-encoders, or encoders-decoders, and their learned features can be useful to learn more about the domain (e.g. via visualization) and in predictive models.
  • In the same way that large weights in the network can signify an unstable and overfit model, large output values in the learned features can signify the same problems. It is desirable to have small values in the learned features, e.g. small outputs or activations from the encoder network.
  • The loss function of the network can be updated to penalize models in proportion to the magnitude of their activation. This is similar to weight regularization where the loss function is updated to penalize the model in proportion to the magnitude of the weights. The output of a layer is referred to as its activation or activity, as such, this form of penalty or regularizations is referred to as activation regularization or activity regularization.
  • The desire to have small activations or even very few activations with mostly zero values is also called a desire for sparsity. As. such, this type of penalty is also referred to as sparse learning.
  • The encouragement of sparse learned features in auto encoder models is referred to as sparse encoders.
  • A constraint can be applied that adds penalty proportional to the magnitude of the vector output of the layer. Two common methods for calculating the magnitude of the activation are:
    • sum of the absolute activations values, called L1 vector norm. This regularization has the effect of encouraging a sparse representation (lots of zeros), which is supported by the rectified linear activation function that permits true zero values.
    • sum of the squared activation values, called the L2 vector norm.
  • Keras supports activity regularization. There are three different regularization techniques supported, each provide as a class in the keras.regularizers module.

Case study

  • For the case study, we are using a standard classification problem that defines two tow -dimensional concentric circles of observations, one circle for each class. Each observation has two input variables with the same scale and a class output value of either 0 or 1.
Two circles problem

  • Then we run a classical MLP on this problem and try to evaluate the accuracy of the model:
MLP on a binary classification problem without activity regularization

  • Then we apply activity regularization before the activation function:
MLP on a binary classification problem with activity regularization before the activation function


  • For the last test we make a trial with the activity regularization positioned after the activation function:
MLP on a binary classification problem with activity regularization after the activation function


Conclusion

  • In these three chapters, we have seen the possibility of:
    • fix overfitting with regularization
    • penalize large weights with weight regularization
    • sparse representations with activity regularization.
  • The theory provided in the book and the case studies are an excellent way to refresh all the hyper-parameters already seen in the previous book from Jason Brownlee.
  • A big thanks to Jason Brownlee for the quality of this book.





jeudi, octobre 22, 2020

"Better Deep Learning" - Jason Brownlee (Chapter 9 to 11)

Preamble

  • This post is an extract of the book "Better Deep Learning" from Jason Brownlee. The post is about Batch normalization described in chapter 9 of the book, greedy layer-wise pretraining described in chapter 10 and transfer learning in chapter 11.
  • I experienced the code provided in the book related to Batch Normalization, greedy layer-wise pre-training, transfer learning and you get in the post the results of this experimentation.
The Big Apple


Chapter 9: Accelerate learning with Batch normalization

  • Training deep neural networks with tens of layers is challenging as they can be sensitive to the initial random weights and configuration of the learning algorithm.One possible reason for this difficulty is the distribution of the inputs to layers deep in the network may change after each minibacth when the weights are updated.
  • Batch normalization is a technique for train very deep neural networks that standardizes the inputs to a layer for each mini batch. This has the effect of stabilizing the learning process and dramatically reducing the number of trains epochs required to train deep networks.

MLP on binary classification without batch normalization

MLP on binary classification with batch normalization after activation function


MLP on binary classification with batch normalization after activation function

Chapter 10: Deeper models with greedy layer-wise pre-training

  • As the number of hidden layers is increased, the amount of error information propagated back to earlier layers is dramatically reduced. This means that weights in hidden layers close to the output layer are updated normally, whereas weights in hidden layers close to the input layer are updated minimally or not at all. Generally, this problem prevented the training of very deep neural networks and was referred to as the vanishing gradient problem. An important milestone in the resurgence of neural networks that initially allowed the development of deeper neural network model was the technique of greedy layer-wise pre-training, often simply referred to as pre-training.
  • Pre-training involves successively adding a new hidden layer to a model and refitting, allowing the newly added model to learn the inputs from the existing hidden layer, often while keeping the weights for the existing hidden layers fixed. This gives the technique the name layer-wise as the model is trained one layer at a time. The technique is referred as greedy because because of the piecewise or layer-wise approach to solving the harder problem of trains a deep network.
  • "Greedy algorithms break a problem into many components, then solve for the optimal version of each component in isolation"..."builds on the premise that training a shallow network is easier than training a deep one."
  • Although the weights in prior layers are held constant, it is common to fine tune all weights in the network at the end after the addition of the final layer.
  • There are two main approaches to pre-training:
    • supervised greedy layer-wise pre-training
    • unsupervised greedy layer-wise pre-training: we can expect unsupervised pretraining to be most helpful when the number of labels is very small. Today, unsupervised pretraining has been largely abandoned, except in the field of natural language processing.

Case study

  • Supervised greedy layer-wise pre-training:
Supervised greedy layer-wise pre-training model

  • Unsupervised greedy layer-wise pre-training:
    Unsupervised greedy layer-wise pre-training


Chapter 11: Jump-start training with transfer learning

  • An interesting benefit of deep learning neural networks is that they can be reused on related problems. Transfer learning refers to a technique for predictive modeling on a different but somehow similar problem that can be reused partly or wholly to accelerate the training and improve the performance of a model on the problem of interest.
  • For example, we may learn about one set of visual categories, such as cats and dogs, in the first setting, then learn about a different set of visual categories, such as ants and wasps, in the second setting.
  • Transfer learning has the benefit of decreasing the training time for a neural network model, resulting in lower generalization error.

Case study

  • Step #1: A small multi class classification problem is used. A first problem will be trained and the weights of this model will be stored and used to for the second problem.
Problem 1 and 2 with three classes

  • Step #2: We develop a MLP for problem #1 and save the model so that we can reuse the weights later:
Loss and accuracy learning curves on the train and test datasets for an MLP

  • Step #3: We develop a MLP now for problem 2 that we will use as a baseline:
Loss and accuracy learning curves on the train and test datasets for an MLP
  • Step #4: We are now using the weights of MLP problem #1 to fit the MLP problem #2
Loss and accuracy learning curves on the train and test sets for an MLP with transfer learning on problem #2

  • Final step: We evaluate the transfer learning by performing 30 repeats to have an average performance. This together with a parameter which is the number of layers that we update or not with weights:
    Comparing standalone and transfer learning model

Conclusion

  • By simply adding batch normalization before the activation function of a MLP neural network, the speed of learning is increased and needs twice as less epochs compared to a scenario without normalization.
  • Pre-training may be useful for problems with small amounts labeled data and large amounts of unlabeled data.
  • Transfer learning is a method for reusing a model trained on a related predictive modeling problem.
  • Thanks to Jason Brownlee for the code and for his explanation.

mercredi, octobre 21, 2020

Better Deep Learning - Jason Brownlee - Chapter 8

Introduction

  • This post is a summary of my notes related to the excellent book from Jason Brownlee, "Better Deep Learning".
  • Only Chapter 8, related to fix exploding gradient with gradient clipping is explored.
  • This is a followup of previous post related to the fixing of vanishing gradient.
Valmestroff, Lorraine, France

The exploding gradient problem

  • Training a neural network can become unstable given the choice of the error function, learning rate, or even the scale of the target variable. Large updates to weights during training can cause a numerical overflow or underflow often referred as exploding gradients. The problem of exploding gradients is more common with recurrent neural networks, such as LSTM given the accumulation of gradients unrolled or over hundreds of input time steps. A common and relatively easy solution to the exploding gradients problems is to change the derivative of the error before propagating it backward through the network and using it to update the weights. Two approaches include rescaling the gradients given a chosen vector norm and clipping gradient values that exceed a preferred range. Together, these methods are referred to as gradient clipping.
  • Neural networks are trained using the stochastic gradient descent optimization algorithm. This requires first the estimation of the loss on one or more training examples, then the calculation of the derivative of the loss, which is propagated backward through the network in order to update the weights. Weights are updated using a fraction of the back propagated error control by the learning rate. It is possible for the updates to the weights to be so large that the weights either overflow or underflow their numerical precision. In practice, the weights can take the value of an NaN (not a number) of Inf (infinity) when they overflow or underflow and for practical purposes the network will be useless from that point forward, forever predicting NaN values as signals flow through the invalide weights.
  • Some case cases of exploding gradients:
    • poor choice of learning rate that results in large weights updates
    • poor choice of data preparation, allowing large differences in the target variable
    • poor choice of loss function, allowing the calculation of large error values
  • Exploding gradients can be avoided in general by careful configuration of the network model, such as choice of a small learning rate, scaled target variables, and a standard loss function.
  • A common solution to exploding gradients is ti change the error derivative before propagating it backward through the network and using it to update the weights.
  • There are two main methods for updating the error derivative; they are:
    • Gradient scaling
    • Gradient clipping
  • Gradient scaling involves normalizing the error gradient vector such that vector norm (magnitude) equals a defined value, such as 10.
  • Gradient clipping involves forcing the gradient values (element-wise) to a specific minimum or maximum value if the gradient exceeded an expected range. Together, these methods are often simply referred to as gradient clipping.
  • It is common to use the same gradient clipping configuration for all layers of the network.

Gradient Clipping Case Study

  • We create a regression predictive problem using a standard regression problem generator provided by the scikit-learn library in the make_regression() function:
Target variable of the regression problem

  • Without scaling any data, we create now a MLP and we get the following results:
MLP with exploding gradient
  • We update the previous model with gradient norm scaling. This consists at setting clipnorm argument on the optimizer:
Mean Squared Error Loss with gradient norm scaling


  • An other solution to address the exploding gradient problem is to clip the gradient if it becomes too large or to small:
Mean Squared Error Loss with gradient clipping


Conclusion

  • By simply adding the clipvalue or clipnorm argument in the optimization algorithm, the example above has demonstrated the ability to remove the exploding gradient problem.




dimanche, octobre 18, 2020

Better Deep Learning - Jason Brownlee - (Chapter 7)

  • This post is a summary of the best sentences, tips and tricks extracted from my reading of the book from Jason Brownlee, "Better Deep Learning"
  • This post is limited to chapter 7, which is about the fixing of vanishing gradient.
Rose - Yvoire

Fix Vanishing Gradients with ReLU

  • In a neural network, the activation function is responsible for transforming the summed weighted input from the node into the activation of the node or output fro that input. The rectified linear function is a piecewise linear function that will output the input directly if positive, otherwise, it will output zero. It has become the default activation function for many types of neural networks because a model that uses it is easier to train and often achieves better performance.
  • The sigmoid and hyperbolic tangent activation functions cannot be used in networks with many layers due to the vanishing gradient problem.
  • The rectified linear activation function overcomes the vanishing gradient problem, allowing models to learn faster and perform better.
  • The rectified linear activation function is the default activation when developing Multilayer Perceptron and convolution neural networks.
  • For a given node, the inputs are multiplied by the weights in a node and summed together. This value is referred as the summed activation of the node. The summed activation is then transformed via an activation function and defines the specific output or activation of the node. The simplest activation function is referred to as the linear activation function, where no transform is applied at all. A network comprised of only linear activation functions is very easy to train, but cannot learn complex mapping functions. Linear activations are still used in the output layer for networks that predict a quantity (e.g. regression problems)
  • Non linear activation functions are preferred as they allow the nodes to learn more complex structures in the data. Traditionally, two widely used nonlinear activation functions are the sigmoid and hyperbolic tangent activation functions.
  • The sigmoid activation function, also called the logistic function, is traditionally a very popular activation function for neural networks. The input to the function is transformed into a value between 0.0 and 1.0.
  • The hyperbolic tangent function, or tanh for short, is a similar shaped nonlinear activation function that outputs values between -1.0 and 1.0.
  • A general problem with both the sigmoid and tanh functions is that they saturate. Layers deep in large networks using these nonlinear functions fail to receive useful gradient information. Error is back propagated through the network and used to update the weights. The amount of error decreases dramatically with each additional layer through which it is propagated, given the derivative of the chosen activation function. This is called the vanishing gradient problem and prevents deep (multilayered) networks from learning effectively.
  • Adoption of ReLU may easily be considered one of the few milestones in the deep learning revolution, e.g. the techniques that now permit the routine development of very deep neural networks. The rectified linear activation function is a simple calculation that returns the value provided as input directly, or the value 0.0 if the input is 0.0 or less.
Rectified Linear Activation for negative and positive inputs

  • The derivative of the rectified linear function is also easy to calculate. Recall that the derivative of the activation function is required when updating the weights of a node as part of the backpropagation error. The derivative of the function is the slope.
  • Tips for using the Rectified Linear Activation:
    • Use ReLU as the default activation function. For a long time, the default activation to use was the sigmoid activation function. Later, it was the tanh activation function. For modern deep learning neural networks, the default activation function is the rectified linear activation function.
    • Deep concovolutional neural networks with ReLU train several times faster than their equivalents with tanh units
    • Use ReLU with MLPS, CNNs, but probably not RNNs
    • Try a smaller bias input value
    • Use He Weight initialization
    • It is good practice to scale input data prior to using a neural network.
    • The Leaky ReLU (LReLU or LReL) modifies the function to allow small negative values when the input is less than zero.

ReLU Case Study

  • How to use ReLU to counter the vanishing gradient problem with a MLP on a simple classification problem. This will be a simple feedforward neural network model, designed as we were taught in the late 1990s and early 200s.
  • The hidden layer will use the hyperbolic tangent activation function (tanh) and the output layer will use the logistic activation function (sigmoid) to predict class 0 or class 1 or something in between. Using the hyperbolic tangent activation function in hidden layers was the best practice in the 1990s and 2000s.

Train and test set accuracy over training epochs for MLP


  • Deeper MLP model with multiple tanh layers:
Train and test set accuracy over training epochs for deep MLP with tanh

  • Deeper MLP model with ReLU:
Train and Test set accuracy over training epochs for deep MLP with ReLU

  • Use of the ReLU activation function has allowed us to fit a much deeper model, but this capability does not extend infinitely:
Train and test set accuracy over training epochs for deep MLP with ReLU with 20 hidden layers





Better Deep Learning - Jason Brownlee (Chapter 5 and 6)

  • This post is the followup of the first part of the book "Deep Learning".
  • This post is a summary of the very best parts of the book "Deep Learning" from Jason Brownlee.
  • In this post, you will get information about:
    • How to configure the speed of learning with Learning rate
    • Stabilize learning with data scaling
Abondance, Alpes

Chapter 5: configure speed of learning with Learning Rate

  • The weights of a neural network cannot be calculated using an analytical method. Instead, the weights must be discovered via an empirical optimization procedure called stochastic gradient descent.
  • The amount of change to the model during each step of this search process, or the step size, is called the learning rate and provides perhaps the most important hyperparameter to tune for your neural network in order to achieve good performance on your problem.
  • Deep learning neural networks are trained using the stochastic gradient descent algorithm. Stochastic gradient descent is an optimization algorithm that estimates the error gradient for the current state of the model using examples from the training dataset, then updates the weight of the model using the back propagation of errors algorithm, referred to as simply as backpropagation.
  • The amount that the weights are updated during training is referred to as the step size or the learning rate. Specifically, the learning rate is a configurable hyperparameter used in the training of neural networks that has a small positive value, often in the range between 0.0 and 1.0.
  • The learning rate controls the amount of apportioned error that the weights of the model are updated with each time they are updated, such as at the end of each batch of training examples.
  • We should not use a learning rate that is too large or too small.
  • The learning rate may, in fact, be the most important hyperparameter to configure for your model.
  • Smaller batch sizes are better suited to smaller learning rates given the noisy estimate of the error gradient.
  • Diagnostic plots can be used to investigate how the learning rate impacts the rate of learning and learning dynamics of the model. One example is to create a line plot of loss over training epochs during training.

Case study on multi class classification problem

Different learning rates

  • Training a neural network can be made easier with the addition of history to the weight update. Specifically, an exponentially weighted average of the prior updates to the weight can be included when the weights are updated. This change to stochastic gradient descent is called momentum and adds inertia to the update procedure, causing many past updates in one direction to continue in that direction in the future.
  • The method of momentum is designed to accelerate learning.
  • Momentum is set to a value greater than 0.0 and less than one, where common values such as 0.9 and 0.99 are used in practice.

Suite of momentums

  • An alternative to using a fixed learning rate is to instead vary the learning rate over the training process. The way in which the learning rate changes over time (training epochs) is referred to as learning rate schedule or learning rate decay. Perhaps the simplest learning rate schedule is to decrease the learning rate linearly from a large initial value to a small value. This allows large weight changes in the beginning of the learning process and small changes or fine-tuning towards the end of the learning process.
  • Instead of choosing a fixed learning rate hyperparameter, the configuration challenge involves choosing the initial learning rate and a learning rate schedule.

Effect of decay on Learning Rate over multiple weight updates

Suite of decay rates

  • Adaptive learning rate: the performance of the model on the training dataset can be monitored by the learning algorithm and the learning rate can be adjusted in response. 
  • There are three adaptive learning rate methods that have proven to be robust over many types of neural network architectures and problem types. They are AdaGradRMSProp, and Adam and all maintain and adapt the learning rates for each of the weights in the model. Perhaps the most popular is Adam, as it builds upon RMSProp and adds momentum.
  • Keras supports rate schedules via callbacks. 
  • Keras provides the ReduceLROnPlateau callback that will adjust the learning rate when a plateau in model performance is detected, e.g. no change for a given number of training epochs.

Learning rate for different patience values used in the ReduceLROnPlateau schedule

Training loss over epochs for different patience values used in the ReduceLROnPlateau schedule


Accuracy over epochs for different patience values used in the ReduceLROnPlateau schedule

Accuracy for a suite of adaptive learning rate

Chapter 6: Stabilize Learning with Data Scaling

  • Given the use of small weights in the model and the use of error between predictions and actual values, the scale of inputs and outputs used to train the model are an in important factor. Unscaled input variables can result in a slow or unstable learning process, whereas unscaled target variables on regression problems can result in exploding gradients causing the learning process to fail. Data preparation involves using techniques such as normalization and standardization to rescale input and output variables prior to training a neural network model.
  • Differences in the scales across input variables may increase the difficulty of the problem being modeled.
  • Scaling input and output is a critical step in using neural network models.
  • A good rule of thumb is that input variables should be small values, probably in the range of 0-1 or standardized with a zero mean and a standard deviation of one.
  • If the distribution of the quantity is normal, then it should be standardized, otherwise the data should be normalized.
  • Normalization is a rescaling of the data from the original range so that all values are within the range of 0 and 1.
  • You can normalize your dataset using the scikit-learn object MinMaxScaler.
  • The default scale for the MinMaxScaler is to rescale variables into the range [0, 1], although a preferred scale can be specified via the feature_range argument and specify a tuple including the min and the max for all variables.
  • If needed, the transform can be inverted. This is useful for converting predictions back into their original scale for reporting or plotting. This can be done by calling the inverse_transform().
  • Standardizing a dataset involves rescaling the distribution of values so that the mean of observed values is 0 and the standard deviation is 1. It is sometimes referred to as whitening.
  • Standardization assumes that your observation fit a Gaussian distribution (bell curve) with a well behaved mean and standard deviation.
  • You can standardize your dataset using the scikit-learn object StandardScaler.

Data Scaling Case Study

  • A regression predictive problem involves predicting a real-value-quantity. We can use a standard regression problem generator provided by the scikit-learn library in the make_regression() function.
Histograms of two of the twenty input variables for the regression problem

Histogram de la variable cible pour le problème de régression

  • We run then a classical MLP model and the result is:
Example output from evaluating an MLP model on the unscaled regression problem
  • This demonstrates that, at the very least, some data scaling is required for the target variable. A line plot of training history is created but does not show anything as the model almost immediately results in a NaN mean squared error.
Mean Squared Error on the regression problem with a standardized target variable


Mean Squared Error with unscaled, Normalized and standardized input variables for the regression problem.