A Dive into Representational Learning and Generative Models with Autoencoders and GANs

In the ever-evolving landscape of artificial intelligence, the quest for machines to understand and generate meaningful representations of data has led to remarkable breakthroughs. Representational learning, a subfield of machine learning, explores the intricate process of learning hierarchical and abstract features from raw data. Two powerful techniques that have gained significant traction in this domain are Autoencoders and Generative Adversarial Networks (GANs).

Figure 1. Generative Adversarial Network

In this blog post, we will embark on a journey to explore the fascinating world of representational learning and generative models, delving into the mechanics of Autoencoders and GANs.

The Jupyter Notebook for this blog can be found here.

Autoencoders, a class of neural networks, serve as unsupervised learning algorithms designed to encode and decode input data. The architecture comprises an encoder and a decoder, working in tandem to compress the input data into a latent representation and subsequently reconstruct it. The key to their efficacy lies in the creation of a bottleneck layer, forcing the network to capture the most salient features of the data.

Components of Autoencoders:

Encoder: Responsible for mapping the input data to a lower-dimensional representation.
Decoder: Reconstructs the original input from the encoded representation.
Latent Space: The compressed representation of the input data.

Applications of Autoencoders:

Data Compression: Efficiently represent and reconstruct data.
Anomaly Detection: Identify outliers by examining reconstruction errors.
Image Denoising: Remove noise from images while preserving essential features.
Generative models: Randomly generate new data that looks very similar to the training data.

2. Efficient Data Representations

An autoencoder looks at the inputs, converts them to an efficient latent representation, and then spits out something that (hopefully) looks very close to the inputs. An autoencoder is always composed of two parts: an encoder (or recognition network) that converts the inputs to a latent representation, followed by a decoder (or generative network) that converts the internal representation to the outputs (see Figure 2).

Figure 2. A Simple Autoencoder

As you can see, an autoencoder typically has the same architecture as a Multi-Layer Perceptron (MLP), except that the number of neurons in the output layer must be equal to the number of inputs. The outputs are often called the reconstructions because the autoencoder tries to reconstruct the inputs, and the cost function contains a reconstruction loss that penalizes the model when the reconstructions are different from the inputs.

Because the internal representation has a lower dimensionality than the input data (it is 2D instead of 3D), the autoencoder is said to be undercomplete. An undercomplete autoencoder cannot trivially copy its inputs to the codings, yet it must find a way to output a copy of its inputs. It is forced to learn the most important features in the input data (and drop the unimportant ones).

3. Performing PCA with an Undercomplete Linear Autoencoder

If the autoencoder uses only linear activations and the cost function is the mean squared error (MSE), then it ends up performing Principal Component Analysis.

The following code builds a simple linear autoencoder to perform PCA on a 3D dataset, projecting it to 2D:

import tensorflow as tf

tf.random.set_seed(42)  # ensures reproducibility

encoder = tf.keras.Sequential([tf.keras.layers.Dense(2, input_shape=[3])])
decoder = tf.keras.Sequential([tf.keras.layers.Dense(3, input_shape=[2])])
autoencoder = tf.keras.Sequential([encoder, decoder])

optimizer = tf.keras.optimizers.SGD(learning_rate=0.5)
autoencoder.compile(loss="mse", optimizer=optimizer)

This code is really not very different from the MLPs, but there are a few things to note:

Both the encoder and decoder are regular Sequential models with a single Dense layer each. The autoencoder layer is followed by the decoder layer.
The autoencoder's number of outputs is equal to the number of inputs (i.e., 3).
To perform simple PCA, we do not use any activation function (i.e., all neurons are linear and the cost function is the MSE).

Now let's train the model on a simple generated 3D dataset and use it to encode that same dataset (i.e., project it to 2D).

history = autoencoder.fit(X_train, X_train, epochs=500, verbose=False)
codings = encoder.predict(X_train)

Note that the same dataset, X_train, is used as both the inputs and the targets. Figure 3 shows the output of the autoencoder's hidden layer. The autoencoder finds the best 2D plane to project the data onto, preserving as much variance in the data as it can (just like PCA).

Figure 3. PCA performed by an undercomplete linear autoencoder

You can think of autoencoders as a form of self-supervised learning (i.e., using a supervised learning technique with automatically generated labels, in this case simply equal to the inputs).

4. Stacked Autoencoders

Stacked Autoencoders, a variant of traditional autoencoders, elevate the capabilities of representational learning by employing multiple layers to capture increasingly complex features from input data. Comprising an encoder and decoder architecture, stacked autoencoders consist of multiple such layers, creating a hierarchical structure, also known as deep autoencoders.

The architecture of a stacked autoencoder is typically symmetrical with regard to the central hidden layer (the coding layer).

Figure 4. Stacked Autoencoder

In Figure 4, a stacked autoencoder for MNIST may have 784 inputs, followed by a hidden layer with 100 neurons, then a central hidden layer of 30 neurons, then another hidden layer with 100 neurons, and an output layer with 784 neurons.

Implementing a Stacked Autoencoder Using Keras

You can implement a stacked autoencoder very much like a regular deep MLP. For example, the following code builds a stacked autoencoder for Fashion MNIST using the SELU activation function:

tf.random.set_seed(42)

stacked_encoder = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(30, activation="relu"),
])
stacked_decoder = tf.keras.Sequential([
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(28 * 28),
    tf.keras.layers.Reshape([28, 28]),
])

stacked_ae = tf.keras.Sequential([stacked_encoder, stacked_decoder])
stacked_ae.compile(loss="mse", optimizer="nadam")
history = stacked_ae.fit(X_train, X_train, epochs=20,
                         validation_data=(X_valid, X_valid))

The decoder takes 28 x 28 grayscale images, and flatten them so that each image is a vector of size 784. This is passed through two Dense layers of diminishing sizes (100 units, then 30 units). For each input image, the encoder outputs a vector of size 30, which is passed to the decoder. Then they are passed through two Dense layers of increasing sizes (100 units then 784 units), maintaining symmetry. When compiling the stacked autoencoder, we use the binary cross-entropy loss. Finally, we train the model using X_train as both the inputs and the targets (and similarly, X_valid as both the validation inputs and targets).

Visualizing the Reconstructions

One way to ensure that an autoencoder is properly trained is to compare the inputs and the outputs: the differences should not be too significant. Let's plot a few images from the validation set, as well as their reconstructions:

Figure 5. Original images (top) and their reconstructions (bottom)

The reconstructions are recognizable, but a bit too lossy. We may need to train the model for longer, or make the encoder and decoder deeper, or make the codings larger.

Visualizing the Fashion MNIST Dataset

Now that we have trained a stacked autoencoder, we can use it to reduce the dataset's dimensionality.

Figure 6. Fashion MNIST visualization plot clustered into categories

5. Convolutional Autoencoders

Convolutional neural networks are far better suited than dense networks to work with images. So if you want to build an autoencoder for images, you will need to build a convolutional autoencoder. The encoder is a regular CNN composed of convolutional layers and pooling layers. It typically reduces the spatial dimensionality of the inputs (i.e., height and width) while increasing the depth (i.e., the number of feature maps). The decoder must do the reverse (upscale the image and reduce its depth back to the original dimensions).

Here is a simple convolutional autoencoder for Fashion MNIST:

tf.random.set_seed(42)

conv_encoder = tf.keras.Sequential([
    tf.keras.layers.Reshape([28, 28, 1]),
    tf.keras.layers.Conv2D(16, 3, padding="same", activation="relu"),
    tf.keras.layers.MaxPool2D(pool_size=2),
    tf.keras.layers.Conv2D(32, 3, padding="same", activation="relu"),
    tf.keras.layers.MaxPool2D(pool_size=2),
    tf.keras.layers.Conv2D(64, 3, padding="same", activation="relu"),
    tf.keras.layers.MaxPool2D(pool_size=2),
    tf.keras.layers.Conv2D(30, 3, padding="same", activation="relu"),
    tf.keras.layers.GlobalAvgPool2D()  # output: 30
])
conv_decoder = tf.keras.Sequential([
    tf.keras.layers.Dense(3 * 3 * 16),
    tf.keras.layers.Reshape((3, 3, 16)),
    tf.keras.layers.Conv2DTranspose(32, 3, strides=2, activation="relu"),
    tf.keras.layers.Conv2DTranspose(16, 3, strides=2, padding="same",
                                    activation="relu"),
    tf.keras.layers.Conv2DTranspose(1, 3, strides=2, padding="same"),
    tf.keras.layers.Reshape([28, 28]),
])
conv_ae = tf.keras.Sequential([conv_encoder, conv_decoder])

conv_ae.compile(loss="mse", optimizer="nadam")
history = conv_ae.fit(X_train, X_train, epochs=10,
                      validation_data=(X_valid, X_valid))

Figure 7. Reconstruction of a convolutional autoencoder

6. Recurrent Autoencoders

If you want to build an autoencoder for sequences, such as time series or text (e.g., for unsupervised learning or dimensionality reduction), then recurrent neural networks may be better suited than dense networks. Building a recurrent autoencoder is straightforward: the encoder is typically a sequence-to-vector RNN which compresses the input sequence down to a single vector. The decoder is a vector-to-sequence RNN that does the reverse:

tf.random.set_seed(42)

recurrent_encoder = tf.keras.Sequential([
    tf.keras.layers.LSTM(100, return_sequences=True, input_shape=[None, 28]),
    tf.keras.layers.LSTM(30)
])
recurrent_decoder = tf.keras.Sequential([
    tf.keras.layers.RepeatVector(28),
    tf.keras.layers.LSTM(100, return_sequences=True),
    tf.keras.layers.Dense(28)
])
recurrent_ae = tf.keras.Sequential([recurrent_encoder, recurrent_decoder])
recurrent_ae.compile(loss="mse", optimizer="nadam")

This recurrent autoencoder can process sequences of any length, with 28 dimensions per time step. Conveniently, this means it can process Fashion MNIST images by treating each image as a sequence of rows: at each time step, the RNN will process a single row of 28 pixels.

7. Denoising Autoencoders

Another way to force the autoencoder to learn useful features is to add noise to its inputs, training it to recover the original, noise-free inputs. The noise can be pure Gaussian noise added to the inputs, or it can be randomly switched-off inputs, just like in dropout (Figure 8).

Figure 8. Denoising autoencoders, with Gaussian noise (left) or dropout (right)

The implementation is straightforward: it is a regular stacked autoencoder with an additional Dropout layer applied to the encoder's inputs. Recall that the Dropout layer is only active during training (and so is the GaussianNoise layer):

tf.random.set_seed(42)  # extra code – ensures reproducibility on CPU

dropout_encoder = tf.keras.Sequential([
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(30, activation="relu")
])
dropout_decoder = tf.keras.Sequential([
    tf.keras.layers.Dense(100, activation="relu"),
    tf.keras.layers.Dense(28 * 28),
    tf.keras.layers.Reshape([28, 28])
])
dropout_ae = tf.keras.Sequential([dropout_encoder, dropout_decoder])

# extra code – compiles and fits the model
dropout_ae.compile(loss="mse", optimizer="nadam")
history = dropout_ae.fit(X_train, X_train, epochs=10,
                         validation_data=(X_valid, X_valid))

Figure 9 shows a few noisy images (with half the pixels turned off), and the images reconstructed by the dropout-based denoising autoencoder. Notice how the autoencoder guesses details that are actually not in the input.

Figure 9. Noisy images (top) and their reconstructions (bottom)

8. Generative Adversarial Networks: An Introduction

GANs, introduced by Ian Goodfellow and his colleagues in 2014, have revolutionized generative modeling. The fundamental idea involves pitting a generator against a discriminator in a competitive game, where the generator learns to create realistic data, and the discriminator learns to distinguish between real and generated samples.

Generator

Takes a random distribution as input (typically Gaussian) and outputs some data - typically, an image. You can think of the random inputs as the latent representations (i.e., codings) of the image to be generated.

Discriminator

Takes either a fake image from the generator or a real image from the training set as input and must guess whether the input image is fake or real.

Figure 10. A generative adversarial network

Applications of GANs:

Image Synthesis: Generate lifelike images from random noise.
Style Transfer: Apply the artistic style of one image to another.
Data Augmentation: Expand training datasets by generating additional samples.

Let's go ahead and build a simple GAN for Fashion MNIST.

First, we need to build the generator and the discriminator. The discriminator is a regular binary classifier (it takes an image as input and ends with a Dense layer containing a single unit and using the sigmoid activation function).

tf.random.set_seed(42)  # ensures reproducibility

codings_size = 30

Dense = tf.keras.layers.Dense
generator = tf.keras.Sequential([
    Dense(100, activation="relu", kernel_initializer="he_normal"),
    Dense(150, activation="relu", kernel_initializer="he_normal"),
    Dense(28 * 28, activation="sigmoid"),
    tf.keras.layers.Reshape([28, 28])
])
discriminator = tf.keras.Sequential([
    tf.keras.layers.Flatten(),
    Dense(150, activation="relu", kernel_initializer="he_normal"),
    Dense(100, activation="relu", kernel_initializer="he_normal"),
    Dense(1, activation="sigmoid")
])
gan = tf.keras.Sequential([generator, discriminator])

Next, we need to compile these models. As the discriminator is a binary classifier, we can naturally use the binary cross-entropy loss. The gan model is also a binary classifier, so it can use the binary cross-entropy loss.

discriminator.compile(loss="binary_crossentropy", optimizer="rmsprop")
discriminator.trainable = False
gan.compile(loss="binary_crossentropy", optimizer="rmsprop")

The Difficulties of Training GANs

Training GANs poses several challenges that have been the focus of extensive research within the machine-learning community.

GANs operate on a competitive framework. Achieving equilibrium between the generator and the discriminator is challenging.
GANs are susceptible to training instability and sensitivity to hyperparameters, which require considerable experimentation and fine-tuning.
The generator may struggle to learn effectively due to the vanishing gradient problem, hindering its ability to generate diverse and realistic samples.
They are notorious for being computationally intensive, demanding significant resources and time for training.

9. Conclusion

Representational learning through Autoencoders and GANs represents a pivotal step forward in the realm of artificial intelligence. These models not only facilitate the extraction of meaningful representations from data but also empower machines to generate content that mirrors human creativity. As researchers continue to refine these techniques, the possibilities for innovation and application in diverse fields are boundless, marking an exciting era in the evolution of machine learning.

Stay tuned for more interesting topics on machine learning!

Machine Learning - Its Impact and Our Future

Search This Blog