europe: Better Deep Learning - Jason Brownlee

This post is a summary of the best sentences, tips and tricks extracted from my reading of the book from Jason Brownlee, "Better Deep Learning"
This post is limited to chapter 7, which is about the fixing of vanishing gradient.

Rose - Yvoire

Fix Vanishing Gradients with ReLU

In a neural network, the activation function is responsible for transforming the summed weighted input from the node into the activation of the node or output fro that input. The rectified linear function is a piecewise linear function that will output the input directly if positive, otherwise, it will output zero. It has become the default activation function for many types of neural networks because a model that uses it is easier to train and often achieves better performance.
The sigmoid and hyperbolic tangent activation functions cannot be used in networks with many layers due to the vanishing gradient problem.
The rectified linear activation function overcomes the vanishing gradient problem, allowing models to learn faster and perform better.
The rectified linear activation function is the default activation when developing Multilayer Perceptron and convolution neural networks.
For a given node, the inputs are multiplied by the weights in a node and summed together. This value is referred as the summed activation of the node. The summed activation is then transformed via an activation function and defines the specific output or activation of the node. The simplest activation function is referred to as the linear activation function, where no transform is applied at all. A network comprised of only linear activation functions is very easy to train, but cannot learn complex mapping functions. Linear activations are still used in the output layer for networks that predict a quantity (e.g. regression problems)
Non linear activation functions are preferred as they allow the nodes to learn more complex structures in the data. Traditionally, two widely used nonlinear activation functions are the sigmoid and hyperbolic tangent activation functions.
The sigmoid activation function, also called the logistic function, is traditionally a very popular activation function for neural networks. The input to the function is transformed into a value between 0.0 and 1.0.
The hyperbolic tangent function, or tanh for short, is a similar shaped nonlinear activation function that outputs values between -1.0 and 1.0.
A general problem with both the sigmoid and tanh functions is that they saturate. Layers deep in large networks using these nonlinear functions fail to receive useful gradient information. Error is back propagated through the network and used to update the weights. The amount of error decreases dramatically with each additional layer through which it is propagated, given the derivative of the chosen activation function. This is called the vanishing gradient problem and prevents deep (multilayered) networks from learning effectively.
Adoption of ReLU may easily be considered one of the few milestones in the deep learning revolution, e.g. the techniques that now permit the routine development of very deep neural networks. The rectified linear activation function is a simple calculation that returns the value provided as input directly, or the value 0.0 if the input is 0.0 or less.

Rectified Linear Activation for negative and positive inputs

The derivative of the rectified linear function is also easy to calculate. Recall that the derivative of the activation function is required when updating the weights of a node as part of the backpropagation error. The derivative of the function is the slope.
Tips for using the Rectified Linear Activation:

Use ReLU as the default activation function. For a long time, the default activation to use was the sigmoid activation function. Later, it was the tanh activation function. For modern deep learning neural networks, the default activation function is the rectified linear activation function.
Deep concovolutional neural networks with ReLU train several times faster than their equivalents with tanh units
Use ReLU with MLPS, CNNs, but probably not RNNs
Try a smaller bias input value
Use He Weight initialization
It is good practice to scale input data prior to using a neural network.
The Leaky ReLU (LReLU or LReL) modifies the function to allow small negative values when the input is less than zero.

ReLU Case Study

How to use ReLU to counter the vanishing gradient problem with a MLP on a simple classification problem. This will be a simple feedforward neural network model, designed as we were taught in the late 1990s and early 200s.
The hidden layer will use the hyperbolic tangent activation function (tanh) and the output layer will use the logistic activation function (sigmoid) to predict class 0 or class 1 or something in between. Using the hyperbolic tangent activation function in hidden layers was the best practice in the 1990s and 2000s.