lundi, août 31, 2020

Deep Learning for Natural Language Processing - Jason Brownlee

Preamble 

  • I am currently reading and practicing the sixth book from Jason Brownlee. Actually I don't remember having read more than six books of the same author in the past. This is an indication that I still get a lot of value out of these books.
  • It's true that you get knowledge at many corners in the book: at the turn of every single sentence in the book, you're at risk at learning something or viewing a topic from a different angle.
  • The fact that you're learning by doing is also a key principle.

Introduction

Part II - Foundations

Chapter 1: Natural Language Processing

  • Natural Language Processing, or NLP for short, is broadly defined as the automatic manipulation of natural language, like speech and text, by software.
  • Linguistics is the scientific study of language, including its grammar, semantics and phonetics.
  • The interesting problems in natural language understanding resist clean mathematical function.
  • Computational linguistic is the modern study of linguistics using the tools of computer science.
  • The statistical approach to studying natural language now dominates the field.
  • Linguistic science, permitting discussion of both classical linguistics and modern statistics method.
  • Statistical NLP aims to do statistical inference for the field of natural language.

Chapter 2: Deep learning

  • Andrew Ng: The core of deep learning is that we now have fast enough computers and enough data to actually train large neural networks.
  • ... almost all the value today of deep learning is through supervised learning or learning from labeled data.Automatic feature extraction from raw data, also called feature learning.
  • Deep learning allows computational models that are composed of multiple processing layers to learn representations of stat with multiple levels of abstraction.
  • "Deep" in deep learning is hype. Andrew prefers "reinforcement learning".
  • Modern state of the art deep learning is focused on training deep (many layered) neural network models using the back propagation algorithm.

Chapter 3: Promise of deep learning for natural language

  • Deep learning methods have the ability to learn feature representations rather than requiring experts to manually specify and extract features from natural language.
  • The promise of deep learning methods is the automatic feature learning.
  • The large blocks of an automatic speech recognition pipeline are speech processing, caustic models, pronunciation models, and language models. The problem is, the properties and importantly the errors of each subsystem are different. This motivates the need to develop one neural network to learn the whole problem end-to-end.

Chapter 4: How to develop deep learning models with Keras

  • Connecting layers: the layers in the model are connected pairwise. A bracket or functional notation is used, such that after the layer is created, the layer from which the input to the current layer comes from is specified:
    • visible : Input(shape=(2,))
    • hidden = Dense(2)(visible)
  • Keras provides a Model class:
    • model = Model(inputs=visible, outputs=hidden)

Part III - Data preparation

Chapter 5: How to clean text manually and with NLTK

  • Have a strong idea about what you're trying to achieve.
  • The Natural Language Toolkit, or NLTK for short, is a Python library written for working and modeling text.
  • NLTK provides the sent_tokenize() function to split text into sentences.
  • Stop words are those words that do not contribute to the deeper meaning of the phrase. They are the most common words such as: the, a and is.
  • Stem words: Stemming refers to the process of reducing each word to its root or base. For example fishing, fished, fisher all reduce to the stem fish.
  • Things always jump out at you when to take time to review your data.

Chapter 6: How to prepare text data with scikit-learn

  • The text must be parsed to remove words, called tokenization. Then the words need to be encoded as integers or floating point values for use as input to a machine learning algorithm, called feature extraction (or vectorization). The scikit-learn library offers easy-to-use tools to perform both tokenization and feature extraction of your text data.
  • Algorithms take vectors of numbers as input, therefore we need to convert documents to fixed-length vectors of numbers.
  • A simple and effective model for thinking about text documents in machine learning is called the Bag-of-Words, or BoW.
  • The model is simple in that it throws away all the order information in the words and focuses on the occurence of words in a document. This can be done by assigning each word a unique number.
  • This is the bag-of-words model, where we are only concerned with encoding schemes that represent what words are present or the degree to which they are present in encoded documents without any information of order.
  • TF-IDF: Term Frequency - Inverse Document Frequency. The IDF of a rare term is high, whereas the IDF of a frequent term is likely to be low.

Chapter 7: How to prepare text data with Keras

  • You cannot feed raw text directly into deep learning models. Text data must be encoded as numbers to be used as input or output for machine learning and deep learning models, such as word embeddings.
  • Words are called tokens and the process of splitting text into tokens is called tokenization.
  • It is popular to represent a document as a sequence of integer values, where each word in the document is represented as a unique integer. Keras provides the one_hot() function that you can use to tokenize and integer encode a text document in one step.
  • Keras provides the hashing_trick() function that tokenizes and the integer encodes the document, just as the one_hot() function.
  • Keras provides the Tokenizer class for preparing text documents for deep learning.

Part IV - Bag-of-Words

Chapter 8: The Bag-of-Words model

  • The bag-of-words is a way of representing text data when modeling text with machine learning algorithms. The bag-of-words model has seen great success in problems such as language modeling and document classification.
  • A popular and simple method of feature extraction with text data is called the bag-of-words model of text.
  • It is called bag-of-words, because any information about the order or structure of words in the document is discarded.
  • As the vocabulary size increases, so does the vector representation of documents.
  • A vector with lot of zero scores, called a sparse vector or sparse representation. Sparse vector require more memory and computational resources when modeling and the vast number of positions  or dimensions can make the modeling process very challenging for traditional algorithms.
  • A bag-of-bigrams representation is much more powerful than bag-of-words.
  • example of bigram: "please turn"

Chapter 9: How to prepare movie review data for sentiment analysis

  • When working with predictive models of text, like a bag-of-words model, there is a pressure to reduce the size of the vocabulary. The larger the vocabulary, the more sparse the representation of each word or document.

Chapter 10: develop a neural Bag-of-words model for sentiment analysis

  • We can develop a vocabulary as a Counter, which is a dictionary mapping of words and their count that allows us to easily update and query.
  • A bag-of-words model is a way of extracting features from text so that the text input can be used with machine learning algorithms like neural networks. Each document is converted into a vector representation.
  • We will use the Keras API to convert reviews to encoded document vectors. Keras provides the Tokenizer class that can do some of the cleaning and vocab definition tasks. The tokenizer class is convenient and will easily transform documents into encoded vectors.
  • Because neural networks are stochastic, they can produce different results when the same model is fit on the same data. This is mainly because of the random initial weights and the shuffling of patterns during mini batch gradient descent.
  • The texts_to_matrix() function for the Tokenizer in the Keras API provides 4 different methods for scoring words:
Encoding schemes for texts_to_matrix()

Part V: Word Embeddings

Chapter 11: The Word Embedding Model

  • Word embeddings are considered to be among a small number of successful applications of unsupervised learning at present.
  • The word2vec tool takes a text corpus as input and produces the word vectors as output. It first constructs a vocabulary from the training text data and then learns vector representation of words.
  • Glove is an unsupervised learning algorithm for obtaining vector representation of words.

Chapter 12: How to Develop Word Embeddings with Gensim

  • Embedding algorithms like Word2vec and Glove are key to the state-of-the-art results achieved by neural network models on natural language processing problems like machine translation.
  • A word embedding is an approach to provide a dense vector representation of words that capture something about their meaning. Word embeddings are an improvement over simpler bag-of-word model word encoding schemes like word counts and frequencies that result in large and sparse vectors (mostly 0 values) that describe documents but not the meaning of the words.
  • The vector space representation of the words provides a projection where words with similar meanings are locally clustered within the space.
  • The use of word embeddings over text representations is one of the key methods that has led breakthrough performance with deep neural networks on problems like machine translation.
  • Gensim is an open source Python library for natural language processing, with a focus on topic modeling. It is billed as "topic for humans". It supports an implementation of the Word2Vec word embedding for learning new word vectors from text.
  • Below is a small example of Word2Vec usage and visualization with PCA (Principal Component Analysis) on a single sentence:
    Plotting word vectors
  • Training your own word vectors may be the best approach for a given NLP problem. But it can take a long time, a fast computer with a lot of RAM and isk space, and perhaps some expertise in finessing the input data and training algorithm. An alternative is to simply use an existing pre-trained word embedding.
  • A pre-trained model is nothing more than a file containing tokens and their associated word vectors. The pre-trained Google Word2Vec model was trained on Google news data (about 100 billion words); it contains 3 million words and phrases and was fit using 300-dimensional word vectors. It is a 1.53 Gigabyte file.
  • You can play arithmetic with vectors. Example (king - man) + woman => the Word2Vec will give you ... queen.

Chapter 13: How to Learn and Load Word Embeddings in Keras

  • The position of a word within the vector space is learned from text and is based on the words that surround the word when it is used. The position of a word in the learned vector space is referred to as its embedding.
  • If you wish to connect a Dense layer directly to an Embedding layer, you must first flatten the 2D output matrix to a 1D vector using the Flatten layer.
  • "The smaller the vocabulary is, the lower is the memory complexity, and the more robustly are the parameters of the words estimated". Tomas Mikolov word2vec-toolkit
  • Keras offers an Embedding layer that can be used for neural networks on text data. The Keras Embedding layer can also used a word embedding learned elsewhere. It is common in the field of Natural Language Processing to learn, save and make freely available word embeddings.

Part VI: Text Classification

Chapter 14: Neural Models for Document Classification

  • The modus operandi for text classification involves the use of word embedding for representing words and a Convolutional Neural Network (CNN) for learning how to discriminate documents on classification problems.
  • "Convolutional neural networks are effective at document classification, namely because they are able to pick up salient features (e.g. tokens or sequences of tokens) in a way that is invariant to their position within the input sequences". Yoav Goldberg.


CNN Filter and Pooling architecture for Natural Language Processing

Chapter 15: Develop an Embedding + CNN Model for Sentiment Analysis

  • We use binary cross-entropy loss function because the problem we are learning is a binary classification problem. (see chapter 4 of Long Short Term Memory Networks with Python)
  • The model is trained for 10 epochs, or 10 passes through the training data.
  • Increasing the number of epochs even to 40 did not increase the reliability of the predictions of the two examples:
  • However increasing the level of detail in the review examples which are submitted for prediction gave a good result:

Chapter 16: Project: Develop an n-gram CNN Model for Sentiment Analysis

  • A standard deep learning model for text classification and sentiment analysis uses a word embedding layer and one-dimensional convolutional neural network. The model can be expanded by using multiple parallel convolutional neural networks that read the source document using different kernel sizes. This, in effect, creates a multi-channel convolutional neural network for text that reads texts with different n-gram sizes (groups of words)
  • Keras functional API vs Keras sequential API
Birieux

Part VII: Language Modeling

Chapter 17: Neural Language Modeling

  • The use of neural networks in language modeling is often called Neural Language Modeling, or NLM for short. Neural network approaches are achieving better results than classical methods both on standalone language models and when models are incorporated into larger models on challenging tasks like speech recognition and machine translation.
  • LSTM allow the models to learn the relevant context over much longer input sequences than the simpler feedforward networks.

Chapter 18: How to Develop a Character-Based Neural Language Model

  • A language model predicts the next word in the sequence based on the specific words that have come before it in the sequence. It is also possible to develop language models at the character level using neural networks. The benefits of character-based language models is their small vocabulary and flexibility in handing any words, punctuation, and other document structure. This comes at the cost of requiring larger models that are slow to train.
  • One hot encode:
    • We need to one hot encode each character. That is, each character becomes a vector as long as the vocabulary (38 items) with a 1 marked for the specific character. This provides a more precise input representation for the network. It also provides a clear objective for the network to predict, where a probability distribution over characters can be output by the model and compared to the ideal case of all 0 values with a 1 for the actual next character.
    • We can use the to_categorical() function in the Keras API to one hot encode the input and output sequences.
  • The model is learning a multi class classification problem, therefore we use the categorical log loss intended for this type of problem. (see chapter 4 of Long Short Term Memory Networks with Python)
  • A small example of text generation with "Le dormeur du val", a poem from Arthur Rimbaud, gave me the following results. The highlighted part of the sentences are the one automatically generated by the LSTM network:
Le Dormeur du val

Chapter 19: How to Develop a Word-Based Neural Language Model

  • Language modeling involves predicting the next word in a sequence given the sequence of words already present. A language model is a key element in many natural language processing models such as machine translation and speech recognition. The choice of how the language model is framed must match how the language model is intended to be used.
  • When making predictions, the process can be seeded with one or few words, then predicted words can be gathered and presented as input on subsequent predictions in order to build a generated output sequence.
  • First the Tokenizer is fit on the source text to develop the mapping from words to unique integers. Then sequences of text can be converted to sequences of integers by calling the text_to_sequences() function.
  • The following example is a very simple model: with one word as input, the model will learn the next word in the sequence:
Source text

Generated text (highlighted) with "Colas" as input seed


Chapter 20: Develop a Neural Language Model for Text Generation

  • A language model can predict the probability of the next word in the sequence, based on the words already observed in the sequence. Neural network models are a preferred method for developing statistical language models because they can use a distributed representation where different words with similar meanings have similar representation and because they can use a large context of recently observed words when making predictions.
  • A key design decision is how long the input sequences should be. They need to be long enough to allow the model to learn the context for the words to predict.
  • Vocabulary size is a big deal with language modeling. A smaller vocabulary results in a smaller model that trains faster.
  • The model uses a distributed representation for words so that different words with similar meanings will have a similar representation.
  • The Tokenizer must be trained on the entire train dataset, which means it finds all of the unique words in the data and assigns a unique integer.
  • We can access the mapping of words to integers as a dictionary attribute called word_index on the Tokenizer object.
  • Words are assigned values from 1 to the total number of words.
  • The learned embedding needs to know the size of the vocabulary and the length of the input sequences.
  • The project of this Chapter 20 consists of generating a sequence of 50 words out of a Platon text (quite big: 15802 lines of text) which is trained before with a LSTM network. 
    • The project is developed in three steps: first is the preparation of the text, then second comes the the training of the network and then comes the  generation of the 50 words sequence from a seed of 50 words. The training part for the Platon text took 3 hours and 50 minutes on my iMac. (100 epochs and batch size of 128)
    • The 50 words seed sequence generated by the neural network were: "which were attributed by us before to the just seeing that you do not hesitate to rank injustice with wisdom and virtue you have guessed most infallibly he replied then i certainly ought not to shrink from going through with the argument so long as i have reason to think that"
    • The 50 generated words by the neural network were: "the same are celebrating in song and intellect with the world of the state and the like in order that he was alive in the days of the soul and the like in order to be sure he said and i will endeavour to explain that they are not a"
  • There was an high load on the cpu but no GPU usage:
    cpu usage during the training phase of Platon text

  • This project plays in the same courtyard as the text from Victor Hugo I developed previously.

Part VIII: Image Captioning

Chapter 21: Neural Image Caption Generation

  • The need to combine breakthroughs from computer vision and natural language processing.
  • Image caption generation also named image annotation or image tagging.
  • Image tagging combines both computer vision and natural language processing and marks a true challenging problem in broader artificial intelligence.
  • Neural network models for captioning involve two main elements:
    • feature extraction
    • language model
  • The feature extraction model is a neural network that given an image is able to extract the salient features, often in the form of a fixed-length vector.
  • A language model predicts the probability of the next word in the sequence given the words already present in the sequence.
  • It is popular to use a recurrent neural network, such as the Long Short Term Memory network, or LSTM, as the language model.
  • This is an architecture developed for machine translation where an input sequence, say it in French, is encoded as a fixed-length vector by an encoder network. A separate decoder network then reads the encoding and generates an output sequence in the new language, say English. A benefit of this approach in addition to the impressive skill of the approach is that a single end-to-end model can be trained on the problem. When adapted for image captioning, the encoder network is a deep convolutional neural network, and the decoder network is a stack of LSTM layers.
  • We investigate models that can attend to salient part of an image while generating its caption.

Chapter 22: Neural Network Models for Caption Generation

Chapter 23: How to Load and Use a Pre-Trained Object Recognition model

  • ImageNet is a research project to develop a large database of images with annotations, e.g. images and their descriptions.
  • VGG released two different CNN models, specifically a 16-layer model and a 19-layer model.
  • More information related to this topic in the excellent book from Jason Brownlee "Deep Learning for Computer Vision"

Chapter 24: How to Evaluate Generated Text With the BLEU score

  • BLEU, or Bilingual Evaluation Understudy, is a score for comparing a candidate translation of text to one or more reference translations. Although developed for translation, it can be used to evaluate text generated for a suite of natural language tasks.
  • The Python Natural Language Toolkit library, or NLTK, provides a function called corpus_bleu() for calculating the BLEU score for multiples sentences such as a paragraph or a document.
  • The sentence_bleu() and corpus_bleu() scores calculate the cumulative 4-gram BLEU score, also called BLEU-4.

Chapter 25: How to Prepare a Photo Caption Dataset for Modeling

  • Flickr8K is a dataset comprised of more than 8000 photos and up to 5 captions for each photo.
  • We may want to use a pre-defined feature extraction model, such as the state-of-the-art deep image classification network trained on Image net. The Oxford Visual Geometry Group (VGG) model is popular for this purpose and is available in Keras.
  • A generator is the term used to describe a function used to return batches of samples for the model to train on. As a reminder, a model is fit for multiple epochs, where one epoch is one pass through the entire training dataset, such as all photos. One epoch is comprised of multiple batches of examples where the model weights are updated at the end of each batch.

Chapter 26: Develop a Neural Image Caption Model

  • The first step in encoding the data is to create a consistent mapping from words to unique integer values. Keras provides the Tokenizer class that can learn this mapping from the loaded description data. The to_lines() convert the dictionary of descriptions into a list of strings and the create_tokenizer() function will fit a Tokenizer given the loaded photo description text.
  • There are two input arrays to the model based on the merge-model described by Marc Tanti: one for photo features and one for the encoded text. There is one output for the model which is the encoded next word in the text sequence.
    The Marc Tanti Merge model
    The caption generation model

  • To reduce overfitting the training dataset, we use a regularization in the form of 50% dropout.
  • Training the model took about 5 hours and 24 minutes, on my iMac computer. Each epoch running about in 15 minutes. The best epoch being epoch # 5 with a loss of 3.83631.
For the model evaluation, I got the following BLEU scores:
BLEU scores evaluating the caption generation model
  • For the prediction part, it is true that with the FlickR image I tried, it works perfect:

Caption prediction 29-08-2020

  • However when I took 5 others photos from my own photothèque and made caption prediction, the results were disappointing.

Part IX: Machine Translation

Chapter 27: Neural Machine Translation

  • « Automatic or machine translation is perhaps one of the most challenging artificial intelligence tasks given the fluidity of human language. Classically, rule-based systems were used for this task, which were replaced in the 1990s with statistical methods. More recently, deep neural network models achieve state-of-the-art results in a field that is aptly named neural machine translation »
  • « Statistical machine translation, or SMT for short, is the use of statistical models that learn to translate text from a source language to a target language given a large corpus of examples. »
  • « The approach is data-driven, requiring only a corpus of examples with both source and target language text. This means linguists are not longer required to specify the rules of translation. »
  • « Neural machine translation, or NMT for short, is the use of neural network models to learn a statistical model for machine translation.« 
  • « Key to the encoder-decoder architecture is the ability of the model to encode the source text into an internat fixed-length representation called the context vector.« 

Chapter 28: What are Encoder-Decoder Models for Neural Machine Translation

  • « The encoder-decoder recurrent neural network architecture is the core technology inside Google’s translate service.« 
  • Two models:

Chapter 29: How to Configure Encoder-Decoder Model for Machine Translation

  • « Research scientists have used Google-scale hardware to provide a set of heuristics for how to configure the encoder-decoder model for neural machine translation and for sequence prediction generally« 

Chapter 30: Develop a Neural Machine Translation Model

  • « Each input and output sequence must be encoded to integers and padded to the maximum phrase length. This is because we will use a word embedding for the input sequences and one hot encode the output sequences« 
  • « The output sequence needs to be one hot encoded. This is because the model will predict the probability of each word in the vocabulary as output »
  • « The model is trained using the efficient Adam approach to stochastic gradient descent and minimizes the categorical loss function because we have framed the prediction problem as multiclass classification. »
  • Running the example for a translation of German to English gave me the following result:
Source, Target and Expected translation - BLEU scores

  • You can observe from the above translation examples that there is still room for human translation work.

Conclusion

  • This is the end of the journey in « Deep Learning for Natural Language Processing » and I did not regret it. 
  • I learned not only a lot of concepts around NLP, but I also had the opportunity to put in practice those concepts with the code examples.
  • The systemic approach from Jason Brownlee is well adapted for me with the mix of concepts simply explained and then put in practice. 
  • The provided Python code examples are easy to read and all the different code sequences clearly separated for understanding.
  • Each chapter comes with a « Further reading » section which is very rich and you can dig in related research documents.
  • Big thanks to Jason Brownlee for this journey. 

mercredi, août 19, 2020

Long Short-Term Memory Networks With Python - Jason Brownlee

 

The more I learn about Deep Learning, the more I feel that this field is never ending and the knowledge required to master that domain is infinite. I am happy to start this 5th book from Jason Brownlee whose promise in his introduction is that by the end of the book I should be "get good at LSTMs fast". According to Jason Brownlee, "Long Short-term Memory (LSTMs) recurrent neural networks are one of the most interesting types of deep learning at the moment".

Chapter 1: What are LSTMs?

  • "Sequence-to-sequence prediction is a subtle but challenging extension of sequence prediction, where, rather than predicting a single next value in the sequence, a new sequence is predicted that may or may not have the same length or be of the same time as the input sequence. This type of problem has recently seen a lot of study in the area of automatic text translation (e.g. translating English to French) and may be referred to by the abbreviation seq2seq."
  • "Long Short-Term Memory (LSTM) is able to solve many time series tasks unsolvable by feedforward networks using fixed size time windows."
  • "LSTMs have internal state, they are explicitly aware of the temporal structure in the inputs, are able to model multiple parallel input series separately, and can step through varied length input sequences to produce variable length output sequences, one observation at a time."
  • "The computational unit of the LSTM network is called the memory cell, memory block, or just cell for short. The term neuron as the computational unit is so ingrained when describing MLPs that it too is often used to refer to the LSTM memory cell. LSTM cells are comprised of weights and gates."
  • Applications of LSTMs:
    • Automatic Image Caption Generation
    • Automatic Translation of text
    • Automatic handwriting generation
  • Limitations of LSTMs:
    • "If your problem looks like a traditional autoregression type problem with the most relevant lag observations within a small window, then perhaps develop a baseline of performance with an MLP and sliding window before considering an LSTM."

Chapter 2: How to train LSTMs

  • Backpropagation refers to two things:
    • The mathematical method used to calculate derivatives and an application of the derivative chain rule.
    • The training algorithm for updating network weights to minimize error.
  • Backpropagation is a supervised learning algorithm that allows the network to be corrected with regard to the specific errors made.
  • Backpropagation algorithm:
    • Present a training input pattern and propagate it through the network to get an output.
    • Compare the predicted outputs to the expected outputs and calculate the error.
    • Calculate the derivatives of the error with respect to the network weights.
    • Adjust the weights to minimize the error.
    • Repeat
  • Recurrent Neural Network: the deeper layers take as input the output of the prior layer as well as a new input time step.
  • Each time step requires a new copy of the network which in turn takes more memory, especially for large networks with thousands or millions of weights.
  • Backpropagation Through Time, or BPTT, is the application of the Backpropagation training algorithm to Recurrent neural Networks.
  • Truncated Backpropagation Through Time, or TBPTT, is a modified version of the BPTT training algorithm for recurrent neural networks where the sequence is processed one step at a time and periodically an update is performed back for a fixed number of time steps.
  • The KERAS deep learning library provides an implementation of TBPTT for training recurrent neural networks.
  • You must split long sequences into subsequences that are both long enough to capture relevant context for making predictions, but short enough to efficiently train the network.

Chapter 3: How to prepare data for LSTMs

  • Normalization is a rescaling of the data from the original range so that all values are within the range of 0 and 1.
  • Standardizing a dataset involves rescaling the distribution of values so that the mean of observed values is 0 and standard deviation is 1. This can be thought of as subtracting the mean value or centering the data.
  • One Hot Encoding
  • Sequence padding
  • Sequence truncating
  • "For example, it may make sense to truncate very long text in a sentiment analysis for efficiency, or it may make sense to pad short text and let the model learn to ignore or explicitly mask zero input values to ensure no data is lost."
  • Pandas shift() function

Chapter 4: How to develop LSTMs in KERAS

  • Input must be three-dimensional, comprised of samples, time steps, and features.
  • You can convert a 1D or 2D dataset to a 3D dataset using the reshape() function in NumPy.
  • You can specify the input_shape argument that expects a tuple containing the number of time steps and the number of features.
Predictive modelingstandard activation function and loss function relation
  • Compilation transforms the simple sequence of layers that we defined into a highly efficient series of matrix transforms in a format intended to be executed on your GPU or CPU.
  • Once the network is compiled, it can be fit, which means adapting the weights on a training dataset.
  • Each epoch can be partitioned into groups of input-output patter pairs called batches. This defines the number of patterns that the network is exposed to before the weights are updated within an epoch.
  • Batch: a pass through a subset of samples in the training dataset after which the network weights are updated. One epoch is comprised of one or more batches.
  • Each LSTM memory unit maintains internal state that is accumulated. By default, the internal state of all LSTM memory units in the network is reset after each batch, e.g. when the network weights are updated.

Chapter 5: models for sequence prediction

  • LSTMs work by learning a function (f(...)) that mass input sequence values (X) onto output sequence values (y)
  • Models and applications:

Chapter 6: How to develop vanilla LSTMs

  • A simple LSTM configuration is the vanilla LSTM. It is named Vanilla in this book to differentiate it from deeper LSTMs and the suite of more elaborate configurations

Chapter 7: how to develop stacked LSTMs

  • The stacked LSTM is a model that has multiple hidden LSTM layers where each layer contains multiple memory cells.
  • It is the depth of neural networks that is generally attributed to the success of the approach on a wide range of challenging prediction problems.
  • A stacked LSTM architecture can be defined as an LSTM model comprised of multiple LSTMs layers.
  • Each LSTMs memory cell requires a 3D input.
  • In time series forecasting, it is good practice to make the series stationary, that is remove any systematic trends and seasonality from the series before modeling the problem.

Chapter 8: How to develop CNN LSTMs

  • The CNN LSTM architecture involves using Convolutional Neural Network (CNN) layers for feature extraction on input data combined with LSTMs to support sequence prediction.
  • CNN LSTMS were developed for:
    • Activity recognition
    • Image description
    • Video description
    • Speech recognition
    • Natural Language Processing
  • The architecture is appropriate for problems that:
    • have spatial structure in their input such as the 2D structure or pixels in an image or the 1D structure of words in a sentence, paragraph or document.
    • have a temporal structure in their input such as the order of images in a video, or words in text.
  • The CNN model is transforming a single image from input pixels into an internal matrix or vector representation.
  • We can define a CNN LSTM model in KERAS by first defining the CNN layer or layers, wrapping them in a TimeDistributed layer and then defining the LSTM and output layers.

Chapter 9: How to develop Encoder-Decoder LSTMs

  • Sequence-to-sequence prediction problems, or seq2seq for short
  • One approach to seq2seq prediction problems that has proven very effective is called the Encoder-Decoder LSTM. This architecture is comprised of two models: one for reading the input sequence and encoding it into a fixed-length vector, and a second for decoding the fixed-length vector and outputting the predicted sequence.
  • It is clear that the RNN Encoder-Decoder captures both semantic and syntactic structures of the phrase.
  • It is natural to use a CNN as an image "encoder", by first pre-training it for an image classification task and using the last hidden layer as an input that the RNN decoder that generates sentences.
  • Applications:
    • Machine translation
    • Learning to execute
    • Image captioning
    • Conversational modeling
    • Movement classification
  • The encoder-decoder can be implemented directly in KERAS.
The repeatVector is used as an adapter to fit the fixed-sized 2D output of the encoder to the differing length and 3D input expected by the decoder.
Abreuvoir - 1951

Chapter 10: How to develop bidirectional LSTMs

  • We were surprised by the extent of the improvement obtained by reversing the words in the source sentences.
  • Bidirectional LSTMs focus on the problem of getting the most out of the input sequence by stepping through input time steps in both the forward and backward directions.
  • "...relying on knowledge of the future seems at first sight to violate causality. How can we base our understanding of what we've heard on something that hasn't been said yet? However, human listeners do exactly that. Sounds, words, and even whole sentences that at first mean nothing are found to make sense in the light of future context." - Framewise Phoneme Classification with bidirectional LSTM and other Neural Network Architectures, 2005 -
  • Bi-directional LSTMs were developed for speech recognition. 
  • The cumulative sum of the input sequence can be calculated using the cumsum() NumPy function.

Chapter 11: How to develop generative LSTMs

  • LSTMs can be used as a generative model. Given a large corpus of sequence data, such as text documents, LSTM models can be designed to learn the general structural properties of the corpus, and when given a seed input, can generate new sequences that are representative of the original corpus.
  • The problem of developing a model to generalize a corpus of text is called language modeling in the field of Natural Language Processing.
  • The approach has also been applied to different domains where a large corpus of existing sequence information is available and new sequences can be generated one step at a time, such as:
    • handwriting generation
    • music generation
    • speech recognition

Chapter 12: How to diagnose and tune LSTMs

  • Certainly one of the most interesting chapter as the author suggest many code examples on how to tune and address common LSTMs problems: underfitting, overfitting, goodfitting. Also a grid search example on how to find the best number of memory cells.

LSTMs are stochastic, meaning that you will get a different diagnostic plot each run
Underfitting
An under fit model is one that is demonstrated to perform well on the training dataset and poor on the test dataset. This can be diagnosed from a plot where the training loss is lower than the validation loss, and the validation loss has a trend that suggests further improvements are possible.
Underfitting with more epochs
Overfitting
An overfitting model is one where performance on the train set is good and continues to improve, whereas performance on the validation set improves to a point and then begins to degrade. This can be diagnosed from a plot where the train loss slopes down and the validation slopes down, hits an inflection point, and starts to slope up again.
Goodfit
Box and whisker plots of memory cells

Chapter 13: How to make predictions with LSTMs

  • « A final LSTM model is one that you use to make predictions on new data. That is, given new examples of input data, you want to use the model to predict the expected output. This may be a classification (assign a label) or a regression (a real value). »
  • « Creating a train and test split of your dataset is one method to quickly evaluate the performance of an algorithm on your problem. The training dataset is used to prepare a model, to train it. We pretend the test dataset is new data where the output values are withheld from the algorithm. We gather predictions from the trained model on the inputs from the test dataset and compare them to the withheld output values of the test set. »
  • « Using k-fold cross-validation is a more robust and more computationally expensive way of calculating this same estimate. We use the estimate of the skill of our LSTM model on a training dataset as a proxy for estimating what the skill of the model will be in practice when making predictions on new data. »
  • « You finalize a model by applying the chosen LSTM architecture and configuration on all of your data. There is no train and test split and no cross-validation folds »
  • « You can save the model architecture (e.g. layers and how they connect) and weights (arrays of numbers) to separate files. I recommend this approach as it allows you to develop updated model weights and replace one file while ensuring the model architecture is left unchanged. »
  • « Keras provides two formats for preserving the model architecture: JSON and YAML formats. The benefit of these formats is that they are human readable. »

Chapter 14: How to update LSTM models

  • « I recommend storing model weights and model structure in separate configuration files.»

Conclusion

  • This relatively short book of only 14 chapters as compared to the others book which usually range in the 28 chapters length, is a deep dive (at least for me) in LSTM network. 
  • The good side for me is that the LSTM examples I went through in the previous books are here more explained in details and I think my previous readings were helpful.
  • The fact that you review notions in a different context help me to memorise the LSTM characteristics and how to manipulate LSTMs.
  • What I liked less was the contrived examples (sum, mul, shapes...), although I understand that this was necessary for didactic purposes. I would have preferred concrete examples from real life.
  • What I liked the most from this book was the tuning of hyper parameters of the LSTM (Chapter 12)
  • I am amazed at this huge quantity of knowledge shared by Jason Brownlee, and also his didactic way of explaining.

mercredi, août 12, 2020

Deep Learning For Time Series Forecasting - Jason Brownlee


The aim of this post is to provide a review of the book "Deep Learning for Time Series Forecasting" from Jason Brownlee. This kind of post is useful for me as an online reminder of the key concepts of this book so I am able to quickly spot where I can retrieve information related to a subject. It might also be useful for the anonymous reader in order to make his/her mind about the book.
Reading this book will give you a sense of mastery, achievement or control. Practice and achievement are one and the same, and in the book you get both practice and achievement.

Chapter 3: How to Develop a Skillful ForecastingModel


  • Given the iterative nature of modeling and evaluating performance.
  • The forecasting method is applied only to a subset of the series.
  • Descriptive modeling = time series analysis
  • Predictive modeling = time series forecasting

Chapter 4: How to Transform Time Series to a Supervised Learning Problem

  • Sliding window method = lag method
  • Supervised learning is where you have input variables (X) and an output variable (y) and you can use an algorithm to learn the mapping fit from the input to the output.
  • Multivariate and multi-steps forecasting can be framed as supervised learning using the sliding window method.
  • Multivariate time series data means data where there is more than one observation for each time step.

Chapter 5: Review of Simple and Classical Forecasting Methods

  • ARIMA: AutoRegressive Integrated Moving Average. A model where the prediction is a weighted linear sum of recent past observations or lags.
  • SARIMA: Seasonal ARIMA
  • Exponential smoothing forecasting methods are similar to ARIMA in that a prediction is a weighted sum of past observations, but the model explicitly uses an exponentially decreasing weight for past observations. In other words, the more recent the observations the higher the associated weights.

Chapter 6: How to prepare Time Series Data for CNNs & LSTMs

  • shape[] #refers too the numbers of rows
  • print(data[:5,:]) #print he first 5 rows for an array of more of 1 column
  • numpy.reshape() # reformat an array by keeping its data

Chapter 7: How to develop MLPs for Time Series forecasting

  • sample: multiple input/output patterns
  • The model will view each time steps as a separate feature instead of separate time steps.
  • hstack() # horizontal stack

Chapter 8 - How to Develop CNNs for Time Series forecasting

  • Univariate time series are datasets comprised of a single series of observations with a temporal ordering and a model is required to learn from the series of past observations to predict the next value in the sequence.

Chapter 9 - How to develop LSTMs for Time Series Forecasting

  • "Key to LSTM is that they offer native support for sequences. Unlike a CNN that reads across the entire input vector, the LSTM model reads one time step of the sequence at a time and builds up an internal state representation that can be read as a learned context for making a prediction."
  • "The CNN can be very effective at automatically extracting and learning features from one-dimensional sequence such as univariate time series data."
  • "Encoder-Decoder Model: The model was designed for prediction problems where there are both input and output sequences, so-called sequence-to-sequence, or seq2seq problems, such as translating text from one language to another."

Chapter 10: Review of Top Methods For Univariate Time Series Forecasting

  • "Classical methods like Theta and ARIMA out-perform machine learning and deep learning methods for multi-step forecasting on univariate datasets."
  • "Machine learning and deep learning methods do not yet deliver on their promise for univariate time series forecasting and there is much work to do."

Chapter 11: How to develop simple methods for univariate forecasting

  • Median: when the distribution of information is not Gaussian.
  • # split a univariate dataset into train/test sets:
  • def train_test_split(data, n_test):
    • return data[:-n_test], data[-n_test:]

Chapter 12: How to develop ETS models for univariate Forecasting

  • ETS : Exponential smoothing for Time Series
  • "Exponential smoothing is a time series forecasting method for univariate data that can be extended to support data with a systematic trend or seasonal component.

Chapter 13: How to develop SARIMA models for univariate forecasting
Chapter 14: How to develop MLPs, CNNs & LSTM for univariate forecasting

  • Walk forward validation "is an approach where the model makes a forecast for each observation in the test dataset one at a time. After each forecast is made for a time step in the test dataset, the true observation for the forecast is added to the test dataset and made available to the model."
  • batch size: "how often the weights are updated within each epoch"
  • RNN: "Recurrent Neural Network use an output of the network from a prior step as an input in attempt to automatically learn across sequence data. LSTM is a type of RNN."

Chapter 15: How to grid search Deep Learning Models. for Univariate forecasting

  • Differencing is the transform of a data such that a value of a prior observation is subtracted from the current observation, removing trend or seasonality structure.
  • In this chapter, the author uses the following time series (univariate with trend and seasonality) and search for the hyper parameters needed for the best forecast. The named "grid search" is applied on a naive persistent method, MLP, CNN and LSTM.
An univariate time series with trend and seasonality


RMSE results with different models

Chapter 16: How to load and explore household energy usage data

  • Are the distribution of Gaussian type?
  • The distribution of active power appears to be bi-modal, meaning it looks like it has two mean groups of observations:
Example of bi-modal distribution

Chapter 17: How to develop naive models for multi-step energy usage forecasting

  • "It is important to test naive forecast models on any new prediction problem. The result provides a baseline performance by which more sophisticated forecast methods can be evaluated"
Naive forecast strategies for household power forecasting 

Chapter 18: How to develop ARIMA models for multi-step energy usage forecasting

  • "The Statsmodel library provides multiple ways of developing an AR model, such as using the AR, ARMA, ARIMA, SARIMAX classes."

Chapter 19: How to develop CNNs for multi-step energy usage forecasting

  • "Unlike other ML algorithms, convolutional neural networks are capable of automatically learning features from sequence data, support multivariate data, and can directly output a vector for multi-step forecasting."
Example of a multi-headed CNN model

Chapter 20: How to develop LSTMs for multi-step energy usage forecasting

  • "The first step in any project is defining your problem."
  • "Perhaps the biggest opportunity for programmers is to put learning machine methods in the application you are developing."
  • "Machine learning methods address a specific decision problem."

Chapter 21: Review of deep learning models for Human Activity Recognition

  • HAR: Human Activity Recognition
  • Sliding window approach
  • "RNN and LSTM are recommended to recognize short activities that have natural order while CNN is better at inferring long term repetitive activities. The reason is that RNN could make use of the time-order relationship between sensor readings, and CNN is more capable of learning deep features contained in recursive patterns." Deep learning for Sensor-based activity recognition: A survey, 2018.

Chapter 22: How to load and explore human activity data

  • One of the interest of the book is that the code of the examples is coming along. If I had to write myself all the lines of code, that would take me a huge amount of time, and might be discouraging. Here you pick up the code and run the examples to see the results. 

Chapter 23: How to develop ML models for Human Activity Recognition

  • A list of machine learning models is evaluated:
    • Non linear algorithms
      • k-Nearest Neighbors
      • Classification and regression tree
      • Support Vector Machine
      • Naive Bayes
    • Ensemble algorithms
      • Bagged decision trees
      • Random Forest
      • Extra trees
      • Gradient Boosting Machine

Chapter 24: How to develop CNNs for Human Activity Recognition

  • "Convolutional neural network models were developed for image classification problems, where the model learns an internal representation of a two-dimensional input, in a process referred to as feature learning. Although we refer to the model as 1D, it supports multiple dimensions of input as separate channels, like the color channels of an image (red, green and blue)."
  • "The benefits of using CNNs for sequence classification is that they can learn from the raw time series data directly, and in turn do not require domain expertise to manually engineer input features."
  • "We must define the CNN model using the Keras deep learning library."
  • "CNNs learn very quickly, so the dropout layer is intended to help slow down the learning process and hopefully result in a better final model. The pooling layer reduces the learned features to 1/4 their size, consolidating them to only the most essential elements. After the CNN and pooling, the learned features are flattened to one long vector and pass through a fully connected layer before the output layer used to make prediction."
  • "The feature maps are the number of times the input is processed or interpreted."
  • "The kernel size is the number of input time steps considered as the input sequence is read or processed onto the feature maps."
  • "The model is fit for a fixed number of epochs, in this case 10, and a batch size of 32 samples, where 32 windows of data will be exposed to the model before the weights of the model are updated."
  • Standardization refers to shifting the distribution of each variable such that it has a mean of zero and a standard deviation of 1. It really makes sense only f the distribution of each variable is Gaussian.
  • The StandardScaler scikit-learn will be used to perform the transform.
1D CNN with and without standardization
  • CNN kernel size: a large kernel size means a less rigorous reading of the data, but may result in a more generalized snapshot of the input.

Holidays


Abondance

Chapter 25: How to develop LTSMs for Human Activity Recognition

  • LTSM network models are a type of Recurrent Neural Network that are able to learn and remember over long sequences of input data.
Human Activity Recognition accuracy with different models

Conclusion

  • Yet another book from Jason Brownlee. This book is helpful for resuming all good practices learned so far: (i) the need for data preparation with plenty of code examples on how to prepare the data, (ii) the mixing of different neural networks for achieving better results at time series forecasting, (iii) the process to achieve always better performance by starting with a baseline based on traditional simple methods and then adding neural network to go beyond.
  • Although the book is rich in term of code examples and good practices, mainly all examples are targeted on how to get better performance on Human Activities prediction. So I am missing here a last step, that is what to do with this? How can I use it for my own ideas? The previous book gave me more insights on how to use the examples on my own data (photo recognition)
  • Once again a very big thanks to Jason Brownlee. All these line didactic lines of codes will certainly be helpful in the future.
  • Now let's start with the next one: "Long Short Term Memory with Python"