Introduction to Artificial Neural Networks (ANN) with Keras

Artificial Neural Networks (ANNs) have emerged as a powerful tool for solving complex problems in fields ranging from image and speech recognition to natural language processing and financial predictions. With the advent of deep learning, ANNs have become even more effective at tackling intricate tasks. In this blog, we'll explore the fundamentals of ANNs and demonstrate how to build them using the popular deep-learning framework Keras.

Biological neurons have served as the inspirational foundation for artificial neurons utilized in Artificial Neural Networks (ANNs). Just as biological neurons transmit signals through synapses, artificial neurons receive input signals, apply mathematical transformations, and produce an output. This conceptual parallel helps ANNs mimic certain aspects of human cognition, including learning, recognition, and decision-making. While artificial neurons are highly simplified abstractions of their biological counterparts, they share the fundamental principle of information processing through interconnected units, making ANNs a powerful tool for tackling complex computational tasks in diverse domains.

Figure 1. Comparison of a biological neuron with an artificial neuron

Logical Computations with Neurons

Logical computations with neurons involve replicating the basic operations of Boolean logic gates, such as AND, OR, and NOT, using artificial neurons. Neurons can be designed to produce specific outputs based on input combinations, effectively emulating the logical operations seen in digital circuits. For example, an artificial neuron can be configured to mimic the behavior of an AND gate by requiring multiple input signals to be active in order to generate an output signal. This capability is fundamental in building the foundation of artificial neural networks, enabling them to process and make decisions on data, making them versatile tools for solving problems in various fields, from image recognition to natural language processing.

Figure. ANNs performing simple logical computations

Let's see what these networks do:

The first network on the left is the identity function: if neuron A is activated, then neuron C gets activated as well; but if neuron A is off, then neuron C is off as well.
The second network performs a logical AND: neuron C is activated only when both neurons A and B are activated (a single input signal is not enough to activate neuron C).
The third network performs a logical OR: neuron C gets activated if either neuron A or neuron B is activated (or both).
The fourth network computes a slightly more complex logical proposition: neuron C is activated only if neuron A is active and neuron B is off. If neuron A is active all the time, then you get a logical NOT: neuron C is active when neuron B is off, and vice versa.

You can imagine how these networks can be combined to compute complex logical expressions.

The Perceptron

The perceptron, often considered the simplest neural network architecture, is a single-layer feedforward model that played a pivotal role in the development of artificial intelligence. Invented by Frank Rosenblatt in 1957, it operates by taking multiple inputs, applying weighted sums, and a threshold activation function to produce an output (Figure 2). Perceptrons are used primarily for binary classification tasks and can learn linear decision boundaries. Though limited to linearly separable problems, they laid the foundation for more complex neural network structures and the broader field of deep learning, showcasing the potential of artificial neurons in making simple yet effective decisions in machine learning applications.

Figure 2. A simple perceptron model: it takes multiple inputs (x1, x2, ..., xm), applies weights (w1, w2, ..., wm) to each of the inputs. We get a weighted sum of inputs (x1.w1, x2.w2, ..., xm.wm) which is passed through the neuron. The output of the neuron is the output of the activation function.

A Threshold Logic Unit (TLU) is an artificial neuron that computes a weighted sum of inputs and then applies a step function.

Common step function used in Perceptrons (assuming threshold=0)

Scikit-Learn provides a Perceptron class that implements a single-TLU network. The dataset used is the Iris flower dataset.

import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron

iris = load_iris(as_frame=True)
X = iris.data[["petal length (cm)", "petal width (cm)"]].values
y = (iris.target == 0) # Iris setosa

per_clf = Perceptron(random_state=42)
per_clf.fit(X, y)

X_new = [[2, 0.5], [3, 1]]
y_pred = per_clf.predict(X_new) # predicts True and False for these 2 flowers

Perceptrons do not output a class probability; rather, they make predictions based on a hard threshold.

The Multilayer Perceptron and Backpropagation

A multilayer perceptron (MLP) is composed of one (passthrough) input layer, one or more layers of TLUs, called hidden layers, and one final layer of TLUs called the output layer (see Figure 3). The layers close to the input layer are usually called the lower layers, and the ones close to the outputs are usually called the upper layers. Every layer except the output layer includes a bias neuron and is fully connected to the next layer.

Figure 3. The architecture of a Multilayer Perceptron with three inputs, one hidden layer of four neurons, and two output neurons

The signal flows only in one direction (from the inputs to the outputs), so this architecture is an example of a feedforward neural network (FNN).

When an ANN contains a deep stack of hidden layers, it is called a deep neural network (DNN). The field of Deep Learning studies DNNs, and more generally models containing deep stacks of computations.

For many years researchers struggled to find a way to train MLPs, without success. But in 1986, David Rumelhart, Geoffrey Hinton, and Ronald Williams introduced the backpropagation training algorithm, which is still used today. In short, it is Gradient Descent using an efficient technique for computing the gradients automatically: in just two passes through the network (one forward, one backward).

ANNs operate in two main phases: feedforward and backpropagation. During feedforward, input data is passed through the network, and predictions are made. Then, during backpropagation, the network's error is calculated, and weights are updated to minimize this error. This iterative process continues until the network converges to a satisfactory solution.

Figure 4. Feedforward and backpropagation in a neural network

Regression MLPs

First, MLPs can be used for regression tasks. If you want to predict a single value (e.g., the price of a house, given many of its features), then you just need a single output neuron: its output is the predicted value. For multivariate regression (i.e., to predict multiple values at once), you need one output neuron per output dimension. For example, to locate the center of an object in an image, you need to predict 2D coordinates, so you need two output neurons.

The loss function to use during training is typically the mean squared error [MSE].

Classification MLPs

MLPs can also be used for classification tasks. For a binary classification problem, you just need a single output neuron using the logistic activation function: the output will be a number between 0 and 1, which you can interpret as the estimated probability of the positive class. The estimated probability of the negative class is equal to one minus that number.

MLPs can also easily handle multilabel binary classification tasks. For example, you could have an email classification system that predicts whether each incoming mail is ham or spam and simultaneously predicts whether it is an urgent or nonurgent email. In this case, you would need two output neurons, both using the logistic activation function.

If each instance can belong only to a single class, out of three or more possible classes (e.g., classes 0 through 9 for digit image classification), then you need to have one output neuron per class, and you should use the softmax activation function for the whole output layer. The softmax function will ensure that all the estimated probabilities are between 0 and that they add up to 1. This is called multiclass classification.

Regarding the loss function, since we are predicting probability distributions, the cross-entropy loss is generally a good choice.

2. Implementing MLPs with Keras

Keras is a high-level Deep Learning API that allows you to easily build, train, evaluate, and execute all sorts of neural networks.

Building an image classifier using the Sequential API

First, we need to load a dataset. In this chapter, we will tackle Fashion MNIST, which is a drop-in replacement for MNIST. It has the exact same format as MNIST (70,000 grayscale images of 28 x 28 pixels each, with 10 classes), but the images represent fashion items rather than handwritten digits.

Using Keras to load the dataset

Keras provides some utility functions to fetch and load common datasets, including Fashion MNIST. Let's load Fashion MNIST:

import tensorflow as tf

fashion_mnist = tf.keras.datasets.fashion_mnist.load_data()
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist

When loading Fashion MNIST using Keras, one important difference is that every image is a 28 x 28 array rather than a 1D array of size 784. Moreover, the pixel intensities are represented as integers (from 0 to 255) rather than floats. Let's take a look at the shape and data type of the training set:

The dataset is already split into a training set and a test set, but there is no validation set, so we'll create one now.

X_train, y_train = X_train_full[:-5000], y_train_full[:-5000]
X_valid, y_valid = X_train_full[-5000:], y_train_full[-5000:]

Additionally, since we are going to train the neural network using Gradient Descent, we must scale the input features. For simplicity, we'll scale the pixel intensities down to the 0-1 range by dividing them by 255.0 (this also converts them to floats).

X_train, X_valid, X_test = X_train / 255., X_valid / 255., X_test / 255.

For Fashion MNIST, however, we need the list of class names to know what we are dealing with:

class_names = ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat",
               "Sandal", "Shirt", "Sneaker", "Bag", "Ankle Boot"]

For example, the first image in the training set represents a coat:

Figure 5 shows some samples from the Fashion MNIST dataset.

Figure 5. Samples from Fashion MNIST

Creating the model using the Sequential API

Now let's build the neural network! Here is a classification MLP with two hidden layers:

model = tf.keras.Sequential()
model.add(tf.keras.layers.Flatten(input_shape=[28, 28]))
model.add(tf.keras.layers.Dense(300, activation="relu"))
model.add(tf.keras.layers.Dense(100, activation="relu"))
model.add(tf.keras.layers.Dense(10, activation="softmax"))

Let's go through this code line by line:

The first line creates a Sequential model. This is the simplest kind of Keras model for neural networks that are composed of a single stack of layers connected sequentially.
Next, we build the first layer and add it to the model. It is a Flatten layer whose role is to convert each input image into a 1D array. Since it is the first layer in the model, you should specify the input_shape.
Next, we add a Dense hidden layer with 300 neurons. It will use the ReLU activation function. Each Dense layer manages its weight matrix, containing all the connection weights between the neurons and their inputs. It also manages a vector of bias terms (one per neuron).
Next, we add a second Dense hidden layer with 100 neurons.
Finally, we add a Dense output layer with 10 neurons (one per class), using the softmax activation function.

Compiling the model

After a model is created, you must call its compile() method to specify the loss function and the optimizer to use.

model.compile(loss="sparse_categorical_crossentropy",
              optimizer="sgd",
              metrics=["accuracy"])

Training and evaluating the model

Now the model is ready to be trained. For this, we simply need to call its fit() method:

We pass it the input features (X_train) and the target classes (y_train), as well as the number of epochs to train. We also pass a validation set (this is optional). Keras will measure the loss and the extra metrics on this set at the end of each epoch, which is very useful to see how well the model really performs. If the performance on the training set is much better than on the validation set, your model is overfitting the training set.

And that's it! The neural network is trained. You can see that the training loss went down, which is a good sign, and the validation accuracy reached 88.86% after 30 epochs. That's not far from the training accuracy, so there does not seem to be much overfitting.

You can see the learning curves shown in Figure 6 from the History object returned by the fit() method. The History object contains the training parameters, the list of epochs it went through, etc.

Figure 6. Learning curves: the mean training loss and accuracy measured over each epoch, and the mean validation loss and accuracy measured at the end of each epoch

The validation curves are close to the training curves, which means that there is not too much overfitting. In this particular case, the model looks like it performed better on the validation set than on the training set at the beginning of the training. But that's not the case: indeed, the validation error is computed at the end of each epoch, while the training error is computed using a running mean during each epoch. So the training curve should be shifted by half an epoch to the left.

Once you are satisfied with your model's validation accuracy, you should evaluate it on the test set to estimate the generalization error before you deploy the model to production. You can easily do this using the evaluate() method.

Using the model to make predictions

Next, we can use the model's predict() method to make predictions on new instances. Since we don't have actual new instances, we will just use the first three instances of the test set:

As you can see, for each instance the model estimates one probability per class, from class 0 to class 9. For example, for the first image, it estimates that the probability of class 9 (ankle boot) is 96%, the probability of class 5 (sandal) is 1%, the probability of class 7 (sneaker) is 3%, and the probabilities of the other classes are negligible. In other words, it "believes" that the first image is footwear, most likely ankle boots but possibly sandals or sneakers.

Similarly, for the 2nd and 3rd instances, the model believes they belong to class 2 (Pullover) and class 1 (Trouser), respectively.

Here, the classifier actually classified all three images correctly (Figure 7)

Figure 7. Correctly classified Fashion MNIST images

Saving and Restoring a Model

When using the Sequential API, saving a trained Keras model is simple:

model.save("my_keras_model", save_format="tf")

Keras will use the HDF5 format to save both the model's architecture and the values of all the model parameters for every layer.

Loading the model is just as easy:

model = tf.keras.models.load_model("my_keras_model")

But what if the training lasts several hours? This is quite common, especially when training on large datasets. In this case, you should not only save your model at the end of training but also save checkpoints at regular intervals during training, to avoid losing everything if your computer crashes. This is where we use callbacks explained below.

Using Callbacks

The fit() method accepts callbacks argument that lets you specify a list of objects that Keras will call at the start and end of the training, at the start and end of each epoch, and even before and after processing each batch.

For example, the ModelCheckpoint callback saves checkpoints of your model at regular intervals during training, by default at the end of each epoch:

checkpoint_cb = tf.keras.callbacks.ModelCheckpoint("my_keras_model.h5",
                                                   save_weights_only=True)

history = model.fit(
    (X_train_wide, X_train_deep), (y_train, y_train), epochs=10,
    validation_data=((X_valid_wide, X_valid_deep), (y_valid, y_valid)),
    callbacks=[checkpoint_cb])

EarlyStopping callback is another callback that interrupts training when it measures no progress on the validation set for several epochs (defined by the patience argument), and it will optionally roll back to the best model. You can combine the EarlyStopping callback with the ModelCheckpoint callback as a list.

early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=10,
                                                     restore_best_weights=True)

history = model.fit(
    (X_train_wide, X_train_deep), (y_train, y_train), epochs=100,
    validation_data=((X_valid_wide, X_valid_deep), (y_valid, y_valid)),
    callbacks=[checkpoint_cb, early_stopping_cb])

The number of epochs can be set to a large value since training will stop automatically when there is no more progress.

3. Fine-tuning a Neural Network

The flexibility of neural networks is also one of their main drawbacks: there are many hyperparameters to tweak. How do you know what combination of hyperparameters is the best for your task? One option is to simply try many combinations of hyperparameters and see which one works best on the validation set (or use K-fold cross-validation). For example, we can use GridSearchCV or RandomizedSearchCV to explore the hyperparameter space.

Number of Hidden Layers

For many problems, you can begin with a single hidden layer and get reasonable results. An MLP with just one hidden layer can theoretically model even the most complex functions, provided it has enough neurons. Real-world data is often structured in such a hierarchical way, and deep neural networks automatically take advantage of this fact: lower hidden layers model low-level structures (e.g., line segments of various shapes and orientations), intermediate hidden layers combine these low-level structures to model intermediate-level structures (e.g., squares, circles), and highest hidden layers and the output layer combine these intermediate structures to model high-level structures (e.g., faces).

Not only does this hierarchical architecture help DNNs converge faster to a good solution, but it also improves their ability to generalize to new datasets.

Number of Neurons per hidden layer

The number of neurons in the input and output layers is determined by the type of input and output your task requires. For example, the MNIST task requires 28 * 28 = 784 input neurons and 10 output neurons.

Just like the number of layers, you can try increasing the number of neurons gradually until the network starts overfitting. But in practice, it's often simpler and more efficient to pick a model with more layers and neurons than you actually need, and then use early stopping and other regularization techniques to prevent it from overfitting.

Learning Rate, Batch Size, and Other Hyperparameters

Learning rate

One way to find a good learning rate is to train the model for a few hundred iterations, starting with a very low learning rate (e.g., 10^(-5)) and gradually increasing it up to a very large value (e.g., 10). This is done by multiplying the learning rate by a constant factor at each iteration.

Optimizer

Choosing a better optimizer than plain old Mini-batch Gradient Descent (and tuning its hyperparameters) is also quite important.

Batch size

The batch size can have a significant impact on your model's performance and training time. One strategy is to try to use a large batch size, using a learning rate warm-up, and if training is unstable or the final performance is disappointing, then try using a small batch size instead.

Activation functions

In general, the ReLU activation function will be a good default for all hidden layers. For the output layer, it really depends on your task.

Number of iterations

In most cases, the number of training iterations does not actually need to be tweaked: just use early stopping instead.

4. Conclusion

Artificial Neural Networks are a fundamental component of modern machine learning and deep learning. Keras, with its easy-to-use API, simplifies the process of building, training, and evaluating ANNs. As you delve deeper into the world of deep learning, you can explore more advanced concepts like convolutional neural networks (CNNs) and recurrent neural networks (RNNs) for tasks such as image recognition and natural language processing.

Stay tuned for more interesting topics on machine learning!

Machine Learning - Its Impact and Our Future

Search This Blog