Transformative Tales: Unleashing the Power of Natural Language Processing with RNNs and Attention Mechanisms

In the ever-evolving landscape of artificial intelligence, Natural Language Processing (NLP) has emerged as a captivating frontier, revolutionizing how machines comprehend and interact with human language. Among the many tools in the NLP arsenal, Recurrent Neural Networks (RNNs) and attention mechanisms stand out as key players, empowering models to understand context, capture nuances, and deliver more sophisticated language processing capabilities.

Let's embark on a journey into the world of NLP, where the synergy of RNNs and attention mechanisms is reshaping the way machines interpret and generate human-like text.

Figure 1. An RNN unrolled through time

The Jupyter Notebook for this blog can be found here.

Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and humans using natural language. The ultimate objective of NLP is to enable machines to understand, interpret, and generate human language in a way that is both meaningful and contextually relevant.

NLP involves a combination of linguistic, statistical, and machine-learning techniques. Early approaches to NLP relied heavily on rule-based systems, but recent advances have been driven by the rise of machine learning and deep learning methods, particularly using neural networks.

Key components and tasks within NLP include:

Tokenization: Breaking down text into smaller units, such as words or phrases, known as tokens.
Part-of-Speech (POS) Tagging: Assigning grammatical labels (e.g., noun, verb, adjective) to each word in a sentence.
Named Entity Recognition (NER): Identifying and classifying entities (e.g., names of people, locations, organizations) in text.
Sentiment Analysis: Determining the sentiment or emotional tone expressed in a piece of text, such as positive, negative, or neutral.
Language Modeling: Understanding the structure and relationships within a given language, often through the use of statistical models or deep learning techniques.
Text Summarization: Generating concise and coherent summaries of longer texts.
Speech Recognition: Converting spoken language into written text.
Question Answering: Building systems that can understand and respond to user queries with relevant information.

Prominent models in the NLP space include recurrent neural networks (RNNs), long-term memory networks (LSTMs), and more recently, transformer-based architectures like BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer).

2. Generative Shakespearean Text Using a Character RNN

In a famous 2015 blog post titled "The Unreasonable Effectiveness of Recurrent Neural Networks, " Andrej Karpathy showed how to train an RNN to predict character in a sentence. This Char-RNN can then be used to generate novel text, one character at a time.

Here is a small sample of the text generated by a Char-RNN model after it was trained on all of Shakespeare's work:

PANDARUS:

Alas, I think he shall be come approached and the day

When little srain would be attain'd into being never fad,

And who is but a chain and subjects of his death,

I should not sleep.

Not exactly a masterpiece, but it is still impressive that the model was able to learn words, grammar proper punctuation, and more, just by learning to predict the next character in a sentence.

Let's look at how to build a Char-RNN, step by step, starting with the creation of the dataset.

Creating the Training Dataset

First, let's download all of Shakespeare's work, using Keras's handy get_file() function and download the data from Andrej Karpathy's Char-RNN project:

import keras

shakespeare_url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
filepath = keras.utils.get_file("shakespeare.txt", shakespeare_url)
with open(filepath) as f:
    shakespeare_text = f.read()

Next, we must encode every character as an integer. We need to fit a tokenizer to the text: it will find all the characters used in the text and map each of them to a different character ID, from 1 to the number of distinct characters:

tokenizer = keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts(shakespeare_text)

We set char_level=True to get character-level encoding rather than the default word-level encoding. Note that this tokenizer converts the text to lowercase by default. Now the tokenizer can encode a sentence (or a list of sentences) to a list of character IDs and back, and it tells us how many distinct characters there are and the total number of characters in the text:

max_id = len(tokenizer.word_index) # number of distinct characters
dataset_size = tokenizer.document_count # total number of characters

Let's encode the full text so each character is represented by its ID (we subtract 1 to get IDs from 0 to 38, rather than from 1 to 39):

[encoded] = np.array(tokenizer.texts_to_sequences([shakespeare_text])) - 1

Before we continue, we need to split the dataset into a training set, a validation set, and a test set. We can't just shuffle all the characters in the text.

How to Split a Sequential Dataset

It is very important to avoid any overlap between the training set, the validation set, and the test set. For example, we can take the first 90% of the text for the training set, then the next 5% for the validation set, and the final 5% for the test set.

Splitting a time series into a training set, a validation set and a test set is not a trivial task. Let's take the first 90% of the text for the training set (keeping the rest for the validation set and the test set), and create a tf.data.Dataset that will return each character one by one from this set:

train_size = dataset_size * 90 // 100
dataset = tf.data.Dataset.from_tensor_slices(encoded[:train_size])

Chopping the Sequential Dataset into Multiple Windows

We will use the dataset's window() method to convert this long sequence of characters into many smaller windows of text. Every instance in the dataset will be a fairly short substring of the whole set, and the RNN will be unrolled only over the length of these substrings. This is called truncated backpropagation through time.

Let's call the window() method to create a dataset of short text windows:

n_steps = 100
window_length = n_steps + 1   # target = input shifted 1 character ahead
dataset = dataset.window(window_length, shift=1, drop_remainder=True)

You can try tuning n_steps: it is easier to train RNNs on shorter input sequences, but of course, the RNN will not be able to learn than n_steps, so don't make it too small.

By default window() method creates nonoverlapping windows, but to get the largest possible training set we use shift=1 so that the first window contains characters 0 to 100, the second contains characters 1 to 101, the third contains characters 2 to 102, and so on. To ensure that all windows are exactly 101 characters long, we set drop_remainder=True (otherwise the last 100 windows will contain 100 characters, 99 characters, and so on down to 1 character).

The window() method creates a dataset that contains windows, each of which is also represented as a dataset. It's a nested dataset, analogous to a list of lists. However, we cannot use a nested dataset directly for training, as our model will expect tensors as inputs, not datasets. So we must call the flat_map() method: it converts a nested dataset into a flat dataset. For example, suppose {1, 2, 3} represents a dataset containing the sequence of tensors 1, 2, and 3. If you flatten the nested dataset {{1, 2}, {3, 4, 5, 6}}, you get back the flat dataset {1, 2, 3, 4, 5, 6}.

dataset = dataset.flat_map(lambda window: window.batch(window_length))

Now the dataset contains consecutive windows of 101 characters each. Gradient Descent works best when the instances in the training set are independent and identically distributed. We need to shuffle these windows. Then we can batch the windows and separate the inputs (the first 100 characters) from the target (the last character):

batch_size = 32
dataset = dataset.shuffle(10000).batch(batch_size)
dataset = dataset.map(lambda windows: (windows[:, :-1], windows[:, 1:]))

Figure 2 summarizes the dataset preparation steps discussed so far (showing windows of length 11 rather than 101, and a batch size of 3 instead of 32).

Figure 2. Preparing a dataset of shuffled windows

Categorical input features should generally be encoded, usually as one-hot vectors or as embeddings. Here, we will encode each character using a one-hot vector because there are fairly few distinct characters (only 39).

dataset = dataset.map(
    lambda X_batch, Y_batch: (tf.one_hot(X_batch, depth=max_id), Y_batch))

Finally, we just need to add prefetching:

dataset = dataset.prefetch(1)

That's it! Preparing the dataset was the hardest part. Now let's create the model.

Building and Training the Char-RNN Model

To predict the next character based on the previous 100 characters, we can use an RNN with 2 GRU layers of 128 units each and 20% dropout on both the inputs (dropout) and the hidden states (recurrent_dropout).

The output layer is a time-distributed Dense layer. This must have 39 units (max_id) because there are 39 distinct characters in the text, and we want to output a probability for each possible character (at each time step). The output probabilities should sum up to 1 at each time step, so we apply the softmax activation function to the outputs of the Dense layer. We can then compile this model, using the "sparse_categorical_crossentropy" loss and an Adam optimizer.

Finally, we are ready to train the model for several epochs:

model = keras.models.Sequential([
    keras.layers.GRU(128, return_sequences=True, input_shape=[None, max_id],
                     dropout=0.2, recurrent_dropout=0.2),
    keras.layers.GRU(128, return_sequences=True,
                     dropout=0.2, recurrent_dropout=0.2),
    keras.layers.TimeDistributed(keras.layers.Dense(max_id,
                                                    activation="softmax"))
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam")
history = model.fit(dataset, epochs=10)

Using the Char-RNN Model

Now we have a model that can predict the next character in a text written by Shakespeare. To feed it some text, we first need to preprocess it like we did earlier, so let's create a little function for this:

def preprocess(texts):
    X = np.array(tokenizer.texts_to_sequences(texts)) - 1
    return tf.one_hot(X, max_id)

Now let's use the model to predict the next letter in some text:

X_new = preprocess(["How are yo"])
Y_pred = np.argmax(model(X_new), axis=-1)
tokenizer.sequences_to_texts(Y_pred + 1)[0][-1] # 1st sentence, last char

Output:'u'

Success! The model guessed right. Now let's use this model to generate new text.

Generating Fake Shakespearean Text

To generate new text using the Char-RNN model, we could feed it some text, make the model predict the most likely next letter, add it at the end of the text, then give the extended text to the model to guess the next letter, and so on. But in practice, this often leads to the same words being repeated over and over again. Instead, we can pick the next character randomly, with a probability equal to the estimated probability. This will generate more diverse and interesting text.

The categorical() function samples random class indices, given the class log probabilities (logits).To have more control over the diversity of the generated text, we can divide the logits by a number called the temperature: a temperature close to 0 will favor the high-probability characters, while a very high temperature will give all characters an equal probability.

The following next_char() function uses this approach to pitch the next character to add to the input text:

def next_char(text, temperature=1):
    X_new = preprocess([text])
    y_proba = model(X_new)[0, -1:, :]
    rescaled_logits = tf.math.log(y_proba) / temperature
    char_id = tf.random.categorical(rescaled_logits, num_samples=1) + 1
    return tokenizer.sequences_to_texts(char_id.numpy())[0]

Next, we can write a small function that will repeatedly call next_char() to get the next character and append it to the given text:

def complete_text(text, n_chars=50, temperature=1):
    for _ in range(n_chars):
        text += next_char(text, temperature)
    return text

We are now ready to generate some text! Let's try with different temperatures:

print(complete_text("t", temperature=0.2))

Output: the more than any man and so we would be so to the

print(complete_text("t", temperature=1))

Output: to co band year enough to her ragh solvice. horten

print(complete_text("t", temperature=2))

Output: tpenio; my vanratigo, and leb'd we deniburs: -faee-

Apparently, our Shakespeare model works best at a temperature close to 1. To generate more convincing text, you could try using more GRU layers and more neurons per layer, train for longer, and add some regularization.

2. Attention Mechanisms

In the realm of Natural Language Processing (NLP), the introduction of attention mechanisms has been a game-changer, revolutionizing the way models understand and process sequential data. This dynamic concept, inspired by human cognitive processes, enables models to selectively focus on relevant parts of input sequences, fostering enhanced contextual understanding and improved performance across various tasks.

Figure 2. Neural machine translation using attention mechanism

Key Concepts:

Contextual Focus:

Attention mechanisms allow models to prioritize specific elements in a sequence, providing mechanisms for contextual focus.
This selective attention mimics the way humans process information, emphasizing the significance of certain elements over others.

Self-Attention:

Self-attention, a crucial variant of attention mechanisms, empowers models to consider relationships between different words in a sequence simultaneously.
This ability to weigh the importance of each word relative to others enhances the model's understanding of intricate dependencies.

Applications of RNNs and Attention in NLP:

Machine Translation: RNNs with attention mechanisms have significantly improved machine translation systems, allowing models to focus on specific words or phrases during the translation process.
Text Summarization: Attention mechanisms enhance the quality of text summarization by enabling the model to prioritize important information while generating concise summaries
Sentiment Analysis: The contextual understanding provided by RNNs, coupled with attention mechanisms, improves sentiment analysis models, allowing them to capture the nuances and context of sentiment in a piece of text.

Challenges and Future Directions:

Computational Complexity: The computational demands, especially in large-scale models, present challenges.
Interpretable Attention: Ensuring that attention mechanisms are interpretable is an ongoing challenge. Researchers are exploring methods to make attention more transparent and explainable to users.

3. Conclusion

As we navigate the intricate landscape of NLP, the marriage of Recurrent Neural Networks and attention mechanisms has ushered in a new era of language processing capabilities. From capturing context in machine translation to fine-tuning sentiment analysis, these dynamic duos are at the forefront of innovation. As research and development in this space continue to flourish, the future holds promise for even more sophisticated models that unravel the complexities of human language with unprecedented depth and precision. The transformative tales of NLP with RNNs and attention mechanisms are only just beginning.

Stay tuned for more interesting topics on machine learning!

Machine Learning - Its Impact and Our Future

Search This Blog