Processing Sequences using Recurrent Neural Networks (RNNs)

In the world of deep learning, Recurrent Neural Networks (RNNs) are a fundamental building block for processing sequential data. They have been used in various applications, such as natural language processing, time series analysis, speech recognition, and more. In this blog, we will delve into the workings of RNNs and explore how they can be harnessed to process sequences effectively.

Figure 1. A Recurrent Neural Network (RNN) architecture

The Jupyter Notebook for this blog can be found here.

The input sequence is fed to the RNN one element at a time, with the hidden state (internal memory) updates at each step.
At each time step, the RNN processes the current input and combines it with the information stored in the hidden state from the previous time step.
The RNN generates an output at each step, which can be used for prediction or classification tasks.
The hidden state is passed to the next time step, allowing the network to maintain a sense of context and memory of past inputs.

2. Recurrent Neurons and Layers

Recurrent neurons and layers are essential components within Recurrent Neural Networks (RNNs), playing a pivotal role in processing sequential data. Unlike traditional feedforward neural networks, recurrent neurons possess internal memory, allowing them to maintain a sense of context and capture dependencies over time. At each time step, these neurons take input from the current data point and combine it with information stored in their internal memory, known as the hidden state, from the previous step.

This recurrent structure enables the network to effectively model and learn patterns within sequences. In the context of RNNs, recurrent layers consist of interconnected recurrent neurons, forming a dynamic network capable of handling sequential information.

Let's look at the simplest possible RNN, composed of one neuron receiving inputs, producing an output, and sending that output back to itself, as shown in Figure 2 (left). At each time step t (also called a frame), this recurrent neuron receives the inputs x(t) as well its own output from the previous time step, y(t-1). Since there is no previous output at the first time step, it is generally set to 0. We can represent this tiny network against the time axis, as shown in Figure 2 (right). This is called unrolling the network through time.

Figure 2. A recurrent neuron (left) unrolled through time (right)

You can easily create a layer of recurrent neurons. At each time step t, every neuron receives both the input vector x(t) and the output vector from the previous time step y(t-1), as shown in Figure 3.

Figure 3. A layer of recurrent neurons (left) unrolled through time (right)

Each recurrent neuron has two sets of weights: one for the input x(t) and the other for the outputs of the previous time step, y(t-1),

Memory Cells

Since the output of a recurrent neuron at time step t is a function of all the inputs from previous time steps, you could say it has a form of memory. A part of a neural network that preserves some state across time steps is called a memory cell (or simply a cell). Memory cells serve as information storage units within the network, allowing it to selectively retain and update information over long sequences.

Input and Output Sequences

An RNN can simultaneously take a sequence of inputs and produce a sequence of outputs (top-left network in Figure 4). This type of sequence-to-sequence network is useful for predicting time series such as stock prices: you feed it the prices over the last N days, and it must output the prices shifted by one day into the future (i.e., from N-1 days ago to tomorrow).

Alternatively, you could feed the network a sequence of inputs and ignore all outputs except the last one (top-right network in Figure 4). In other words, this is a sequence-to-vector network. For example, you could feed the network a sequence of words corresponding to a movie review, and the network would output a sentiment score (e.g., from -1 [hate] to +1 [love]).

Conversely, you could feed the network the same input vector over and over again at each time step and let it output a sequence (see the bottom-left network of Figure 4). This is a vector-to-sequence network. For example, the input could be an image, and the output could be a caption for that image.

Lastly, you could have a sequence-to-vector network, called an encoder, followed by a vector-to-sequence network, called a decoder (see the bottom-right network in Figure 4). For example, this could be used for translating a sentence from one language to another. You would feed the network a sentence in one language, the encoder would convert this sentence into a single vector representation and then the decoder would decode this vector into a sentence in another language. This two-step model is called an Encoder-Decoder.

Figure 4. Seq-to-seq (top left), seq-to-vector (top right), vector-to-seq (bottom left), and
Encoder-Decoder (bottom right) network

3. Training RNNs

To train an RNN, the trick is to unroll it through time (like we just did) and then simply use regular backpropagation (see Figure 5). This strategy is called backpropagation through time (BPTT).

Just like in regular backpropagation, there is a first forward pass through the unrolled network (represented by the dashed arrows). Then the output sequence is evaluated using a cost function. The gradients of that cost function are then propagated backward through the unrolled network (represented by the solid arrows). Finally, the model parameters are updated using the gradients computed during BPTT.

Figure 5. Backpropagation through time

Note that the gradients flow backward through all the outputs used by the cost function, not just through the final output.

4. Handling Long Sequences

To train an RNN on long sequences, we must run it over many time steps, making the unrolled RNN a very deep network. Just like any deep neural network, it may suffer from the unstable gradients problem: it may take forever to train, or training may be unstable. Moreover, when an RNN processes a long sequence, it will gradually forget the first inputs in the sequence.

Tackling the Short-Term Memory Problem

Due to the transformations that the data goes through when traversing an RNN, some information is lost at each step. After a while, the RNN's state contains virtually no trace of the first inputs. To tackle this problem, various types of cells with long-term memory have been introduced.

LSTM cells

The Long Short-Term Memory (LSTM) cell was proposed in 1997 by Sepp Hochreiter and Jurgen Schmidhuber. If you consider the LSTM cell as a black box, it can be used very much like a basic cell, except it will perform much better: training will converge faster, and it will detect long-term dependencies in the data.

LSTMs utilize memory cells equipped with three gates: the input gate, which controls the flow of new information; the forget gate, which decides what information to discard from the cell; and the output gate, which determines the information to be passed to the next time step (Figure 6). This gating mechanism allows LSTMs to selectively store and retrieve information over extended sequences.

Figure 6. An LSTM cell

GRU cells

The Gated Recurrent Unit (GRU) cell in Figure 7 was proposed by Kyunghun Cho et al. in 2014. The GRU is another variant of RNNs that, like LSTMs, aims to address the vanishing gradient problem while simplifying the network architecture. GRUs have a more streamlined structure with two gates: the update fate, which combines the functions of the input and forget gates in LSTMs, and the reset gate, which controls the information to be discarded.

Figure 7. A GRU cell

The reduced complexity of GRUs compared to LSTMs makes them computationally more efficient and easier to train. GRUs have proven effective in various applications, striking a balance between performance and simplicity.

5. Challenges and Considerations

While RNNs are powerful tools for sequence processing, they come with their own set of challenges:

Vanishing Gradient Problem: Training deep RNNs can be challenging because of the vanishing gradient problem, where gradients diminish as they are propagated backward through time. LSTMs and GRUs were developed to mitigate this issue.
Training Time: RNNs can be computationally expensive to train, especially on long sequences. Techniques like mini-batch training and GPU acceleration can help.
Overfitting: RNNs are prone to overfitting, so regularization techniques and proper validation are essential.
Choosing the Right Architecture: Deciding between a vanilla RNN, LSTM, or GRU depends on the specific task and dataset. Experimentation is often required.

6. Conclusion

Recurrent Neural Networks (RNNs) have revolutionized the field of sequence processing. Their ability to capture temporal dependencies makes them invaluable for tasks ranging from natural language processing to time series analysis and beyond. As you explore the world of RNNs, remember to experiment with different architectures and techniques to find the best approach for your specific problem. With practice and creativity, you can leverage RNNs to unlock the potential of sequential data and drive innovation in various domains.

Stay tuned for blogs on more important topics!

Machine Learning - Its Impact and Our Future

Search This Blog