europe: "Better Deep Learning" - Jason Brownlee (chapter 12 to 14)

Preamble

This post is an extract of the book "Better Deep Learning" from Jason Brownlee. The post is about Fix Overfitting with Regularization described in chapter 12 of the book, Penalize large weights with weight regularization described in chapter 13 and Sparse representations with activity regularization in chapter 14.
The post is a collection of the sentences that ring my bell and which I find useful to remember: the theory.
The post also describes the results of running the code provided in the book: the practice
You have the theory and the practice, that it is what I like in the books of Jason Brownlee.

Alpes: vue depuis l'aiguille du midi.

Chapter 12: Fix overfitting with regularization

Training a deep neural network that can generalize well to new data is a challenging problem. A model with too little capacity cannot learn the problem, whereas a model with too much capacity can learn it too well and overfit the training dataset. Both cases result in a model that does not generalize well. A modern approach to reducing generalization error is to use a larger model that may be required to use regularization during training that keeps the weights of the model small.
Underfitting can easily be addressed by increasing the capacity of the network, but overfitting requires the use of specialized techniques
Regularization methods like weight decay provide an easy way to control overfitting for large neural networks models.
A modern recommendation for regularization is to use early stopping with dropout and a weight constraint.
The challenge in machine learning is that we must perform well on new, previously unseen inputs, not just those on which the model was trained. The ability to perform well on previously unobserved inputs is called generalization.
A model fit can be considered in the context of the bias-variance trade-off. An underfit model has high bias and low variance. An overfit model has low bias and large variance.
A benefit of very deep neural networks is that their performance continues to improve as they are fed larger and larger datasets.
A problem is said to be ill-posed if small changes in the given information cause large changes in the solution. This instability with respect to the data makes solutions unreliable because small measurement errors or uncertainties in parameters may be greatly magnified and lead to wildly different responses. The idea behind regularization is to use supplementary information to restate an ill-posed problem in a stable form.
Regularization methods:

weight regularization: penalize the model during the training based on the magnitude of the weights
activity regularization: penalize the model during training based on the magnitude of the activations
weight constraint: contain the magnitude of weights to be within a range or below a limit
dropout: probabilistically remove inputs during training
noise: add statistical noise to inputs during training
early stopping: monitor model performance on a validation set and stop training when performance degrades

Chapter 13: Penalize large weights with weight regularization

Neural networks learn a set of weights that best map inputs to outputs. A network with large network weights can be a sign of an unstable network where small changes in the input can lead to large changes in the output. This can be sign that the network has overfit the training dataset and will likely perform poorly when making predictions on new data. A solution to this problem is to update the learning algorithm to encourage the network to keep the weights small. This is called weight regularization and it can be used as a general technique to reduce overfitting of the training dataset and improve the generalization of the model.
Large weights in a neural network are a sign of a more complex network that has overfit the training data.
Penalizing a network based on the size of the network weights during training can reduce overfitting.
An L1 or L2 vector norm penalty can be added to the optimization of the network to encourage smaller weights.
Simpler model are less likely to over-fit than complex ones.
Remember, that when we train a neural network, we minimize a loss function, such as the log loss in classification or mean squared error in regression. In calculating the loss between the predicted and expected values in a batch, we can add the current size of all weights in the network or add in a layer to this calculation. This is called a penalty because we are penalizing the model proportional to the size of the weights in the model.
Larger weights result in a larger penalty, in the form of a larger loss score.
Smaller weights are considered more regular or less specialized and as such, we refer to this penalty as weight regularization.
The addition of a weight size penalty or weight regularization to a neural network has the effect of reducing generalization error and of allowing the model to pay less attention to less relevant input variables.
To calculate the size of the weights, they are two approaches:

calculate the sum of the absolute values of the weights, called the L1 norm
calculate the sum of the squared values of the weights, called the L2 norm

The use of L2 in linear and logistic regression is often referred to as Ridge Regression.
The weights may be considered a vector and the magnitude of a vector is called its norm. As such, penalizing the model based on the size of the weights is also referred to as weight or parameter norm penalty. It is possible to include both L1 and L2 approaches to calculating the size of the weights as the penalty. This is akin to the use of both penalties used in the Elastic Net algorithm for linear and logistic regression. The L2 approach is perhaps the most used and is traditionally referred to as weight decay in the field of neural networks. It is called a shrinkage in statistics, a name that encourages you to think of the impact of the penalty on the model weights during the learning process.
Recall that each node has input weights and a bias weight.
When using weight regularization, it is possible to use larger networks with less risk of overfitting. A good configuration strategy may be to start with larger networks and use weight decay.
A weight regularizer can be added to each layer when the layer is defined in KERAS model. This is achieved by setting the kernel_regularizer argument on each layer.

Case Study

The case study aims at reducing the overfitting of an MLP network.

Moons dataset showing the class value of each sample

Overfitting the MLP

Overfitting correction of the MLP with L2 normalization

Grid searching of the best weight regularization parameter

Chapter 14: Sparse representations with activity regularization

The output of a hidden layer within the network represent the learned features by the model at that point in the network.
There is a field of study focused on the efficient and effective automatic learning of features, often investigated by having a network reduce an input to a small learned feature before using a second network to reconstruct the original input from the learned feature. Models of this type are called auto-encoders, or encoders-decoders, and their learned features can be useful to learn more about the domain (e.g. via visualization) and in predictive models.
In the same way that large weights in the network can signify an unstable and overfit model, large output values in the learned features can signify the same problems. It is desirable to have small values in the learned features, e.g. small outputs or activations from the encoder network.
The loss function of the network can be updated to penalize models in proportion to the magnitude of their activation. This is similar to weight regularization where the loss function is updated to penalize the model in proportion to the magnitude of the weights. The output of a layer is referred to as its activation or activity, as such, this form of penalty or regularizations is referred to as activation regularization or activity regularization.
The desire to have small activations or even very few activations with mostly zero values is also called a desire for sparsity. As. such, this type of penalty is also referred to as sparse learning.
The encouragement of sparse learned features in auto encoder models is referred to as sparse encoders.
A constraint can be applied that adds penalty proportional to the magnitude of the vector output of the layer. Two common methods for calculating the magnitude of the activation are:

sum of the absolute activations values, called L1 vector norm. This regularization has the effect of encouraging a sparse representation (lots of zeros), which is supported by the rectified linear activation function that permits true zero values.
sum of the squared activation values, called the L2 vector norm.

Keras supports activity regularization. There are three different regularization techniques supported, each provide as a class in the keras.regularizers module.

Case study

For the case study, we are using a standard classification problem that defines two tow -dimensional concentric circles of observations, one circle for each class. Each observation has two input variables with the same scale and a class output value of either 0 or 1.

Two circles problem

Then we run a classical MLP on this problem and try to evaluate the accuracy of the model:

MLP on a binary classification problem without activity regularization

Then we apply activity regularization before the activation function:

MLP on a binary classification problem with activity regularization before the activation function

For the last test we make a trial with the activity regularization positioned after the activation function:

MLP on a binary classification problem with activity regularization after the activation function

Conclusion

In these three chapters, we have seen the possibility of:

fix overfitting with regularization
penalize large weights with weight regularization
sparse representations with activity regularization.

The theory provided in the book and the case studies are an excellent way to refresh all the hyper-parameters already seen in the previous book from Jason Brownlee.
A big thanks to Jason Brownlee for the quality of this book.

europe

Libellés

dimanche, octobre 25, 2020

"Better Deep Learning" - Jason Brownlee (chapter 12 to 14)