Clustering in Machine Learning: A Comprehensive Guide

Machine Learning is revolutionizing how we process and analyze data, and one of its fundamental tasks is classification. Classification is the process of categorizing data points into predefined classes or categorical based on their features. It has many applications, from spam email detection and image recognition to medical diagnosis and sentiment analysis. In this blog, we will dive into the world of classification in machine learning, exploring its key concepts, algorithms, and creating a machine learning model using the MNIST dataset.

Understanding Classification

At its core, classification involves assigning labels or categories to data points. These labels can be binary (yes/no, true/false) or multi-class (such as identifying different types of fruits). To perform classification, machine learning models use patterns and relationships with the data to make predictions about the class labels of new, unseen data points.

Key concepts in Classification:

Features: Features are the characteristics or attributes of data points that the model uses to make predictions. In image classification, for example, features might include pixel values, color histograms, or texture patterns.
Labels: Labels are the categories or classes that the data point belongs to. In a spam email classifier, the labels could be "spam" and "not spam".
Training Data: This is the labeled dataset used to train the classification model. It consists of data points with known features and corresponding labels.
Test Data: Test data is used to evaluate the model's performance. It contains data points without labels, and the model's predictions are compared to the true labels to assess accuracy.

Classification tasks come with their own set of challenges, including:

Imbalanced Data: when one class is significantly more prevalent than others, the model may have a bias towards the majority class.
Feature Engineering: Selecting the right feature and preprocessing the data is crucial for model performance.
Overfitting: Models may perform well on the training data but poorly on new data if they overfit, meaning they've learned noise in the training set.
Evaluation Metrics: Choosing appropriate evaluation metrics (e.g., accuracy, precision, recall, F1-score) depends on the specific problem and dataset.

In each fold, the accuracy is above 95%! Which at first can give a false idea that the model is performing very well. But in the case of classification, accuracy is not a very reliable parameter when evaluating a model's performance, especially when you are dealing with skewed datasets (one class has more data points than the rest). This is because only about 5% of the images are not 5s, so every time it makes a prediction, it is confirmed that the number is 5 by a probability of 95%.

Confusion Matrix

A much better way to evaluate the performance of a classifier is to look at the confusion matrix.

A confusion matrix provides a clear and concise summary of a model's predictions by showing the count of true positive, true negative, false positive, and false negative predictions.

True positives (TP) represent correctly predicted positive instances,
True negatives (TN) are correctly predicted negative instances,
False positives (FP) are instances wrongly predicted as positive, and
False negatives (FN) are instances wrongly predicted as negative.

Analyzing these values helps in assessing the model's accuracy, precision, recall, and other performance metrics, enabling data scientist to gain valuable insights into the model's strengths and weaknesses.

Figure 2. Confusion Matrix

To compute the confusion matrix, you first need to have a set of predictions so that they can be compared to the actual targets.

from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

In the above code, cross_val_predict() performs K-fold cross-validation and returns the predictions made on each fold. To the confusion_matrix() function, just pass the target classes (y_train_5) and the predicted classes (y_train_pred)

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_train_5, y_train_pred)
cm

A perfect classifier would have only true positives and true negatives, so its confusion matrix would have nonzero value on its main diagonal (top left to bottom right):

Precision, Recall, and F1-score

Precision and Recall are two crucial performance metrics for evaluating the effectiveness of classification models, especially in scenarios where imbalanced classes or different costs for false positives and false negatives are at play.

Precision measures the accuracy of positive predictions made by a model. A high precision indicates that when the model predicts a positive income, it is likely to be correct, minimizing false positives. Precision is essential in situations where false positives have high costs, such as in medical diagnoses or fraud detection.

On the other hand, recall, also known as sensitivity or true positive rate, gauges the model's ability to capture all positive instances.

High recall means the model is effective at identifying most of the positive cases, reducing false negatives. Recall is vital in scenarios where missing a positive instance is costly, like in search and rescue operations or disease detection.

Figure 2. Formulas of Precision, REcalll and F1 score

It is often convenient to combine precision and recall into a single metric called the F1 score. The F1 score is the harmonic mean of precision and recall. Whereas the regular mean treats all values equally, the harmonic mean gives gives much more weight to low values. As a result, the classifier will only get a high F1-score if both precision and recall are high.

To compute the F1 score, simply call f1_score() function:

Precision/ Recall Trade-off

The precision/recall tradeoff is a fundamental concept in machine learning that reflects the delicate balance between two critical performance metrics: precision and recall. Increasing precision typically comes at the cost of reduced recall, and vice versa.

When we adjust the threshold for classification in a model, we affect its ability to make positive predictions. A higher threshold tends to improve precision by naming the model more conservative in its positive predictions, but it may miss some true positive cases, resulting in lower recall. Conversely, a lower threshold increases recall by capturing more positive instances but may also lead to false positives, reducing precision. Striking the right balance depends on the specific goals and requirements of the project.

Coming back to our problem:

The SGDClassifier uses a threshold equal to 0, so the previous code returns the same results as the predict() method (i.e., True).

Let's raise the threshold:

This confirms that raising the threshold decreases recall.

How do you decide which threshold to use? First, use the cross_val_predict() function to get the scores of all instances in the training set, but this time specify that you want to return decision scores instead of predictions:

y_scores = cross_val_predict(sgd_clf, X_train, y_train_5,
                             cv=3, method='decision_function')

With these scores, use the precision_recall_curve() function to compute precision and recall for all possible thresholds:

from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

Now, plot a precision and recall against the decision threshold.

Figure 3. Precision and recall vs. the decision threshold

Another way to select a good precision/recall trade-off is to plot precision directly against recall.

Figure 4. Precision versus Recall

You can see that precision really starts to fall sharply around 80% recall. You might want to select a precision-recall trade-off just before that drop - for example, at around 60% recall. But of course, the choice depends on your project.

The ROC Curve

The Receiver Operating Characteristic (ROC) Curve is a graphical representation used to assess the performance of binary classification models. It displays the trade-off between the true positive rate (sensitivity) and the false positive rate (1 - specificity) as the classification threshold is varied. A well-constructed ROC curve shows how well a model can distinguish between the positive and negative classes, with an ideal model having a curve that hugs the top-left corner (high sensitivity and low false positive rate).

The area under the ROC curve (AUC-ROC) is a common metric used to quantify the overall performance of a model, where higher values indicate better discrimination.

Data scientists and machine learning practitioners use ROC curves and AUC-ROC to compare and select the most suitable model for various applications, particularly when handling imbalanced datasets or when the cost of false positives and false negatives varies.

To plot the ROC curve, you first use the roc_curve() function to compute the TPR and FPR for various threshold values:

from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

Figure 5. This ROC curve plots the false positive rate against the true positive rate for all possible thresholds; the black circle highlights the chosen ratio

A perfect classifier will have a ROC AUC equal to 1, whereas a purely random classifier will have a ROC AUC equal to 0.5. Your model's ROC AUC will lie in between, the higher the better.

Let's now train a RandomForestClassifier and compare its ROC curve and ROC AUC score to those of the SGDClassifier. First, you need to get scores for each instance in the training set.

from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3,
                                    method='predict_proba')

The roc_curve() function expects labels and scores, but instead of scores you can give it class probabilities. Let's use the positive class's probability as the score:

y_scores_forest = y_probas_forest[:, 1]
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5, y_scores_forest)

Now let's plot the ROC curve.

Figure 6. Comparing ROC curves: the Random Forest classifier is superior to the SGD classifier because its ROC curve is much closer to the top-left corner, and it has a greater AUC

As you can see in Figure 6, the RandomForestClassifier's ROC curve looks much better than the SGDClassifier's: it comes much closer to the top-left corner. Its ROC AUC score as below:

Looking at the precision and recall as well:

4. Multiclass Classification

Whereas binary classifiers distinguish between two classes, multiclass classifiers (also called multinomial classifiers) can distinguish between more than two classes.

This above code trains the SVC on the training set using the original target classes from 0 to 9. Then it makes a prediction.

If you call the decision_function() method, you will see that it returns 10 scores per instance (instead of just 1). That's one score per class:

The highest score is indeed the one corresponding to class 5:

5. Error Analysis

Try out multiple models (shortlisting the best ones and fine-tuning their hyperparameters using GridSearchCV), and automate as much as possible. Here, we will assume that you have found a promising model and you want to find ways to improve it. One way to do this is to analyze the types of errors it makes.

Plotting the confusion matrix below:

from sklearn.metrics import ConfusionMatrixDisplay

y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
ConfusionMatrixDisplay.from_predictions(y_train, y_train_pred,
                                        normalize="true", values_format=".0%")

This confusion matrix looks pretty good, since most images are on the main diagonal, which means they are classified correctly.

Let's focus the plot on the errors. First, you need to divide each value in the confusion matrix by the number of images in the corresponding class so that you can compare error rates instead of absolute numbers of errors.

Filling the diagonal with zeros to keep only the errors, and plot the result:

You can clearly see the kinds of errors the classifier makes. Remember that rows represent actual classes, while columns represent predicted classes.

The column for class 8 is quite bright, which tells you that many images are misclassified as 8s (65%). However, the row for class 8 is not that bad, telling you that actual 8s in general get properly classified as 8s. You can also see that 3s and 5s often get confused (in both directions).

Looking at this plot, it seems that you should focus on reducing the false 8s. For example, you can try to gather more training data for digits that look like 8s (but are not 8s) so that the classifier can learn to distinguish them from real 8s.

Let's plot examples of 3s and 5s:

cl_a, cl_b = '3', '5'
X_ab = X_train[(y_train == cl_a) & (y_train_pred == cl_b)]
X_aa = X_train[(y_train == cl_a) & (y_train_pred == cl_a)]
X_ba = X_train[(y_train == cl_b) & (y_train_pred == cl_a)]
X_bb = X_train[(y_train == cl_b) & (y_train_pred == cl_b)]

Figure 7. The model confusing between 3s and 5s

The main difference between 3s and 5s is the position of the small line that joins the top line to the bottom arc. If you draw a 3 with the junction slightly shifted to the left, the classifier might classify it as a 5, and vice versa. In other words, the classifier is quite sensitive to image shifting and rotation. One way is to preprocess the images to ensure that they are well centered and not too rotated.

6. Multilabel Classification

Multilabel classification is a machine learning task where a model assigns multiple labels or categories to each data point, as opposed to traditional single-label classification where each data point belongs to a single category. This type of classification is prevalent in various real-world scenarios, such as tagging content, topic labeling in text, and object recognition in images, where multiple labels may be relevant to describe the data accurately.

Consider a face-recognition classifier: what should it do if it recognizes several people in the same picture? It should attach one tag per person it recognizes. Say the classifier has been trained to recognize three faces: Alice, Bob and Charlie. Then when the classifier is shown a picture of Alice and Charlie, it should output [1, 0, 1] (meaning "Alice yes, Bob no, Charlie yes"). The classifier assigns two labels in the particular image, that it identifies Alice and Bob in that image.

import numpy as np
from sklearn.neighbors import KNeighborsClassifier

y_train_large = (y_train >= '7')
y_train_odd = (y_train.astype('int8') % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)

This code creates a y_multilabel array containing two target labels for each digit image: the first indicates whether or not the digit is large (7, 8, or 9), and the second indicates whether or not it is odd. The next lines create a KNeighborsClassifier instance (which supports multilabel classification), and we train it using the multiple targets array.

Now making a prediction, and notice that it outputs two labels:

And it gets it right! The digit 5 is indeed not large (False) and odd (True)

Evaluating the multilabel classifier by measuring the F1 score.

7. Multioutput Classification

The last type of classification that we are going to discuss here is called multioutput-multiclass classification (or simply multioutput classification), is a specialized machine learning task where the goal is to predict output variables for each input data point. Each output variable can take on multiple possible values, similar to multiclass classification.

However, in multioutput classification, the output variables are not independent and there may be relationships or dependencies among them. This task is often encountered in various domains, including image processing, natural language processing, and remote sensing, where a single input may correspond to multiple distinct output variables or attributes.

To illustrate this, let's build a system that removes noise from images. It will take as input a noisy digit image, and it will (hopefully) output a clean digit image, represented as an array of pixel intensities, just like the MNIST images. Notice that the classifier's output is multilabel (one label per pixel) and each label can have multiple values (pixel intensity ranges from 0 to 255). It is thus an example of a multioutput classification system.

Let's start by creating the training and test sets by taking the MNIST images and adding noise to their pixel intensities with NumPy's randint() function. The target images will be the original images:

np.random.seed(42)
noise = np.random.randint(0, 100, (len(X_train), 784))
X_train_mod = X_train + noise
noise = np.random.randint(0, 100, (len(X_test), 784))
X_test_mod = X_test + noise
y_train_mod = X_train
y_test_mod = X_test

Figure 8. On the left is the noisy output image, and on the right is the clean target image.

Now let's train the classifier and make it clean this image:

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train_mod, y_train_mod)
clean_digit = knn_clf.predict([X_test_mod[0]])
plot_digit(clean_digit)

Figure 9. The cleaned image output

Looks close enough to the target!

8. Conclusion

Classification is a fundamental task in machine learning with widespread applications across various domains. Understanding the key concepts, algorithms, and challenges associated with classification is essential for building accurate and effective models.

Stay tuned for more interesting topics on machine learning!

Machine Learning - Its Impact and Our Future

Search This Blog