europe: Better Deep Learning - Jason Brownlee (Chapter 1 to 4)

Preamble

This post intends to be my own summary of the book "Better Deep Learning" from Jason Brownlee.
This is the eight book from Jason Brownlee in the field of neural network that I am reading.
"Focus on how to get things done", "Learn by doing" are the spirit of the book.
In this post you will get information about :

Improve learning by understanding optimization
Configure capacity with nodes and layers
Configure gradient precision with batch size
Configure what to optimize with loss functions

Introduction

The last 5 to 10 years has seen the development and adoption of modern network configurations, regularization techniques, and ensemble algorithms that result in superior performance.
The challenge of getting good performance can be broken down into three main areas: problems with learning, problems with generalization, and problems with predictions.
The techniques are available; you just need to know what they are and when to use them.
You must diagnose the type of performance problem you are having with your model, then carefully choose and evaluate a given intervention tailored to that diagnosed problem. They are three types of problems:

problems with learning: better learning are techniques that improve or accelerate the adaptation of neural network model weights in response to a training dataset.
problems with generalization: better generalization are techniques that improve the performance of a neural network model on a holdout dataset.
problems with predictions: better predictions are techniques that reduce the variance in the performance of a final model.

Diagnostic Learning Curves: a learning curve is a plot of model learning performance over experience or time.
Train Learning Curve: learning curve calculated from a hold-out validation dataset that gives an idea of how well the model is learning.
Validation Learning Curve: learning curve calculated from a hold-out validation dataset that gives an idea of how well the model is generalizing.
Underfit Learning Curves: Underfitting refers to a model that cannot learn the training dataset. An underfoot model can be identified from the learning curve of the train loss only. An underfit model may also be identified by a training loss that is decreasing and continues to decrease at the end of the plot.
Overfit Learning Curves: the problem with overfitting, is that the more specialized the model becomes to training data, the less well it is able to generalize to new data, resulting in an increase in generalization error. The loss of the model will almost always be lower on the training dataset than the validation dataset.
Unrepresentative Train Dataset: means that the train dataset does not provide sufficient information to learn the problem, relative to the validation dataset used to evaluate it.

Part I Better Learning

Chapter 1: Improve Learning by Understanding Optimization

Deep Learning networks learn a mapping function from inputs to outputs. This is achieved by updating the weights of the network in response to the errors the models makes on the training dataset. Updates are made to continually reduce this error until either a good enough model is found or the learning process gets stuck and stops. The process of training neural networks is the most challenging part of using the technique in general and is by far the most time consuming, both in terms of effort required to configure the process and computational complexity required to execute the process.
Developing a model requires historical data from the domain that is used as training data.
For example, a problem where the output is a quantity would be described generally as regression predictive problem. Whereas a problem where the output is a label would be described generally as a classification predictive modeling problem.
The ability to work well on specific examples and new examples is called the ability of the model to generalize.
A multilayer perceptron is just a mathematical function mapping some set of input values to output values.
As such, we can describe the broader problems that neural neural networks solve as function approximation. They learn to approximate an unknown underlying mapping function given a training dataset. They do this by learning weights (the model parameters), given a specific network structure that we must specify.
The lowest point in the landscape is referred to as the global minima.
It is hard to know whether the optimization algorithm is in a local minima or not.
A classical approach to addressing the problem of local minima is to restart the search process multiple times with a different starting point (random initial weights) and allow the organization algorithm to find a different, and hopefully better, local minima. This is called multiple restarts or random restarts.

Convex and non-convex

Navigating the non-convex error surface: this involves repeating the steps of evaluating the model and updating the model parameters in order to step down the error surface.
The algorithm that is most commonly used to navigate the error surface is called stochastic gradient descent, or SGD for short.
The stochastic gradient descent is more efficient as it uses the gradient information specifically to update the model weights via an algorithm called back propagation.
Actually, back-propagation refers only to the method for computing the gradient, while another algorithm, such as stochastic gradient descent, is used to perform learning using this gradient.
Hyperparameters:

Network topology
Loss function

An error function must be chosen, often called the objective function, cost function, or the loss function. Typically, a specific probabilistic framework for inference is chosen called Maximum Likelihood. Under this framework, the commonly chosen loss function are cross-entropy for classification problems and mean squared error for regression problems.

Weight initialization
Batch size
Learning rate
Epochs
Data preparation

Neural networks learn a mapping function from inputs to outputs that can be summarized as solving the problem of function approximation.
Unlike other machine learning algorithms, the parameters of a neural network must be found by solving a non-convex optimization problem with many good solutions and many misleading good solutions.
The stochastic gradient descent is used to solve the optimization problem where model parameters are updated each iteration using the the backpropagation algorithm.

Chapter 2: Configure capacity with Nodes and Layers

We can control wether a model is more likely to overfit or underfit by altering its capacity.
The capacity of a neural network can be controlled by two aspects of the model:

number of nodes: the width
number of layers: the depth

Keras allows you to easily add nodes and layers to your model.
Convolutional neural networks, or CNNs, don't have nodes, instead specify the number of filter maps and their shape.
An important différence is that recurrent layers expect a three-dimensional input, therefore the prior recurrent layer must return the full sequence of outputs rather than the single output for each node at the end of the input sequence.
The scikit-learn class provides the make_blobs() function that can be used to create a multi classification problem with the prescribed number of samples, input variables, classes, and variance of samples within a class.

Example of a multi class classification problem

Influence of the number of nodes

Influence of the number of layers

A model with a hidden layer of 10 nodes is not equivalent to a model with two hidden layers with five nodes each. The latter has a much greater capacity. The danger is that a model with more capacity than is required is likely to overfit the training data, and as with a model that has too many nodes, a model with too many layers will likely be unable to learn the training dataset, getting lost or stuck during the optimization process.

Chapter 3: Configure Gradient Precision with Batch Size

Neural networks are trained using gradient descent where the estimate of the error used to update the weights is calculated based on a subset of the training dataset. The number of examples from the training dataset used in the estimate of the error gradient called the batch size and is an important hyper parameter that influences the dynamic of the learning algorithm.
Batch size controls the accuracy of the estimate of the error gradient when training neural networks.
There is a tension between batch size and the speed and stability of the learning process.
Neural networks are trained using the stochastic gradient descent optimization algorithm. This involves using the current state of the model to make a prediction, comparing the predictions to the actual values, and using the difference as an estimate of the error gradient. This error gradient is then used to update the model weights and the process is repeated. The error gradient is a statistical estimate. The more training examples used in the estimate, the more accurate this estimate will be and the more likely that the weights of the network will be adjusted in a way that will improve the performance of the model.
Optimization algorithms that use the entire train set are called batch or deterministic gradient methods, because they process all of the train examples simultaneously in a large batch.
The number of trains examples used in the estimate of the error gradient is a hyperparameter for the learning algorithm called the batch size, or simply the batch. A batch size of 32 means that 32 samples from the train dataset will be used to estimate the error gradient before the model weights are updated. One train epoch means that the learning algorithm has made one pass through the train dataset (using every sample once), where examples were separated into randomly selected batch size groups.
Batch gradient descent: batch size is set to the total number of examples in the train dataset
Stochastic gradient descent: batch size is set to one
Minibatch gradient descent: batch size is set to more than one and less than the total number of examples in the training dataset.
We need to one hot encode the target variable, transforming the integer class values into binary vectors.

Cross-entropy and accuracy for a MLP with Batch Gradient Descent

Cross-entropy and accuracy MLP with Stochastic Gradient Descent

MLP fit with Stochastic Gradient Descent and smaller learning rate

Batch gradient descent: use a relatively larger learning rate and more training epochs.
Stochastic Gradient Descent: use a relatively smaller learning rate and fewer trains epochs.
An alternative to using stochastic gradient descent and tuning the learning rate is to hold the learning rate constant and to change the batch size. In effect, it means that we specify the rate of learning or amount of change to apply to the weights each time we estimate the error gradient, but to vary the accuracy of the gradient based on the number of samples used to estimate it. Holding the learning rate at 0.01 as we did with batch gradient descent, we can set the batch size to 32, a widely adopted batch size.

MLP fit with Minibatch Gradient Descent

Effect of batch size

The plots above show that the small batch results generally in rapid learning but a volatile learning process with higher variance in the classification accuracy. Larger batch size slow down the learning process (in terms of learning curves) but the final stages result in a convergence to a more stable model exemplified by lower variance in classification accuracy.

Chapter 4: Configure what to optimize with Loss Functions

Neural networks are trained using an optimization process that requires a loss function to calculate the model error.
Maximum Likelihood provides a framework for choosing a loss function when training neural networks and machine learning models in general. Maximum Likelihood Estimation, or MLE, is a framework for inference for finding the best statistical estimates of parameters from historical training data.
Cross-entropy and mean squared error are the two mains types of loss functions to use when training neural networks models.
The gradient in gradient descent refers to an error gradient.The model with a given set of weights is used to make predictions and the error for this predictions is calculated. The gradient descent algorithm seeks to change the weights so that the next evaluation reduces the error, meaning the optimization algorithm is navigating down the gradient (or slope) of error.
Importantly, the choice of loss function is directly related to the activation function used in the output layer of your neural network (see chapter 4 of this post).

The case study for regression

Mean Squared Error loss, or MSE for short, is calculated as the average of the squared differences between the predicted and actual value.
There may be regression problems in which the target value has a spread of values and when predicting a large value, you may not want to punish as heavily as mean squared error. Instead, you can first calculate the natural logarithm of each of the predicted values, then calculate the mean squared error. This is called the Mean Squared Logarithmic Error loss, or MSLE for short. It has the effect of relaxing the punishing effect of large differences in large predicted values.

Comparison of MSLE vs MSE

On some regression problems, the distribution of the target variable may be mostly Gaussian, but may have outliers, e.g. large or small values far from the mean value. The Mean Absolute Error, or MAE, loss is an appropriate loss function in this case as it is more robust to outliers. It is calculated as the average of the absolute difference between the actual and predicted values.

Mean Absolute Error and Mean Squared Error over training epochs

The case study for binary classification

Cross-entropy is the default loss function to use for binary classification problems. It is intended for use with binary classification where the target are in the set {0, 1}. Mathematically, it is the preferred loss function under the inference framework of maximum likelihood. It is the loss function to be evaluated first and only changed if you have a good reason. Cross-entropy will calculate a score that summarizes the average difference between the actual and predicted probability distributions for predicting class 1. The score is minimized and a perfect cross-entropy value is 0.

Two circles classification problem used for binary classification

Cross-Entropy Loss and Classification accuracy over training epochs on the two circles binary classification problem.

An alternative to cross-entropy for binary classification problems is the hinge loss function, primarily developed for use with Support Vector Machine (SVM) models. It is intended for use with binary classification where the target values are in the set {-1, 1}. The hinge loss function encourages examples to have the correct sign, assigning more error when there is a difference in the sign between the actual and predicted class values. Reports of performance with the hinge loss are mixed, sometimes resulting in better performance than cross-entropy on binary classification problems.

Hinge loss and classification accuracy over training epochs on the two circles binary classification problem

A popular extension of hinge loss is called the square hinge loss that simply calculates the square of the score hinge loss.

Squared Hinge Loss

The case study for multi class classification

Defining a contrived multi class classification function case study: multi class classification predictive modeling problems are those examples are assigned one of more than two classes. The make_blobs function provided by sciait-learn provides a way to generate examples given a specified number of classes and input features.

Scatter plot of examples generated from the blobs multi class classification problem

Multiclass cross-entropy loss: cross-entropy is the default loss function for multiclass classification problems. Cross-entropy will calculate a score that summarizes the average difference between the actual and predicted probability distributions for all classes in the problem. The score is minimized and a perfect cross-entropy value is 0. Cross-entropy can be specified as the loss function in Keras by specifying 'categorical_crossentropy' when compiling the model.

Cross-entropy loss and classification accuracy over training epochs on the blobs multi classification problem

A possible cause of frustration when using cross-entropy with classification problems with a large number of labels is the one hot encoding. For example, predicting words in a vocabulary may have ten thousands of categories, one for each label. This can mean that the target element of each training example may require a one hot encoded vector with tens or hundreds of thousands of zero values, requiring significant memory. Sparse cross-entropy addresses this by performing the same cross-entropy calculation for error, without requiring that the target variable be one hot encoded prior to training. Sparse cross-entropy can be used in Keras for multi class classification by using 'sparse_categorical_crossentropy'.

Sparse cross-entropy loss and classification accuracy over training epochs on the blobs multi class classification problem

Kullback Leibler Divergence, or KL Divergence for short, is a measure of how one probability distribution differs from a baseline distribution. A KL divergence loss of 0 suggest the distribution are identical. In practice, the behavior of KL divergence is very similar to cross-entropy. KL divergence loss can be used in Keras by specifying 'Kullback_leibler_divergence' in the compile() function.

KL Divergence loss and classification accuracy over training epochs on the blobs multi class classification problem

Conclusion

This first 4 chapters of "Better Deep Learning" book from Jason Brownlee were worth reading.
I will continue posting the most important lessons learned from this book in the next blog post.

europe

Libellés

samedi, octobre 17, 2020

Better Deep Learning - Jason Brownlee (Chapter 1 to 4)