Machine Learning is revolutionizing how we process and analyze data, and one of its fundamental tasks is classification. Classification is the process of categorizing data points into predefined classes or categorical based on their features. It has many applications, from spam email detection and image recognition to medical diagnosis and sentiment analysis. In this blog, we will dive into the world of classification in machine learning, exploring its key concepts, algorithms, and creating a machine learning model using the MNIST dataset.
Understanding Classification
At its core, classification involves assigning labels or categories to data points. These labels can be binary (yes/no, true/false) or multi-class (such as identifying different types of fruits). To perform classification, machine learning models use patterns and relationships with the data to make predictions about the class labels of new, unseen data points.
Key concepts in Classification:
- Features: Features are the characteristics or attributes of data points that the model uses to make predictions. In image classification, for example, features might include pixel values, color histograms, or texture patterns.
- Labels: Labels are the categories or classes that the data point belongs to. In a spam email classifier, the labels could be "spam" and "not spam".
- Training Data: This is the labeled dataset used to train the classification model. It consists of data points with known features and corresponding labels.
- Test Data: Test data is used to evaluate the model's performance. It contains data points without labels, and the model's predictions are compared to the true labels to assess accuracy.
Classification tasks come with their own set of challenges, including:
- Imbalanced Data: when one class is significantly more prevalent than others, the model may have a bias towards the majority class.
- Feature Engineering: Selecting the right feature and preprocessing the data is crucial for model performance.
- Overfitting: Models may perform well on the training data but poorly on new data if they overfit, meaning they've learned noise in the training set.
- Evaluation Metrics: Choosing appropriate evaluation metrics (e.g., accuracy, precision, recall, F1-score) depends on the specific problem and dataset.
Table of Contents:
- Understanding the Dataset: MNIST
- Training a Binary Classifier
- Performance Measures
- Measuring Accuracy Using Cross-Validation
- Confusion Matrix
- Precision and Recall
- Precision/ Recall Trade-off
- The ROC Curve
- Multiclass Classification
- Error Analysis
- Multilabel Classification
- Multioutput Classification
- Conclusion
The full notebook for this blog can be found here.
1. Understanding the Dataset: MNIST
In this blog, we will be using the MNIST dataset, which is a set of 70,000 small images of handwritten digits. Each image is labeled with the digit it represents. This set has been studied so much that it is often called the "hello world" of Machine Learning.
Scikit-Learn provides many helper functions to download popular datasets. MNIST is one of them. The following code fetches the MNIST dataset:
There are 70,000 images, and each image has 784 features. This is because each image has dimensions of 28 x 28, and each feature represents one pixel's intensity, from 0 (white) to 255 (black).
Let's take a peek at one digit from the dataset. All you need to do is grab one instance's feature vector, reshape it into a 28 x 28 array, and display it using matplotlib's imshow() function:
import matplotlib.pyplot as plt
some_digit = X[0]
some_digit_image = some_digit.reshape(28, 28)
plt.imshow(image, cmap="binary")
plt.axis("off")
plt.show()
Note that the label is a string. Most ML algorithms expect numbers, so let's convert y into integer:
y = y.astype(np.uint8)
| Figure 1. Digits from the MNIST dataset |
We need to create a test set and set it aside before inspecting the data closely. The MNIST dataset is actually already split into a training set (the first 60,000 images) and a test set (the last 10,000 images):
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
The training set is already shuffled for us, which is good because this guarantees that all cross-validation folds will be similar (you don't want one fold to be missing some digits).
2. Training a Binary Classifier
Let's simplify the problem for now and create a binary classifier - that classifies whether a number is 5 or not, for example. The "5-detector" will be an example of a binary classifier.
Let's create the target vectors for this classification task:
y_train_5 = (y_train == '5') # True for all 5s, False for all other digits
y_test_5 = (y_test == '5')
Let's pick a classifier and train it. A good place to start is with a Stochastic Gradient Descent (SGD) classifier, using Scikit-Learn's SGDClassifier class.
SGD classifier has the advantage of being capable of handling very large datasets efficiently. It deals with training instances independently, one at a time (which makes it well-suited for online learning).
Training the classifier on the whole training dataset:
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)
Now we can use it to detect images of the number 5:
The classifier guesses that this image represents a 5 (True). It guessed right in this particular case! Now let's evaluate this model's performance.
3. Performance Measures
Evaluating a classifier is often significantly trickier than evaluating a regressor.
Measuring Accuracy Using Cross-Validation
Let's use the cross_val_score() function to evaluate our SGDClassifier model, using K-fold cross-validation with three folds.
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")
In each fold, the accuracy is above 95%! Which at first can give a false idea that the model is performing very well. But in the case of classification, accuracy is not a very reliable parameter when evaluating a model's performance, especially when you are dealing with skewed datasets (one class has more data points than the rest). This is because only about 5% of the images are not 5s, so every time it makes a prediction, it is confirmed that the number is 5 by a probability of 95%.
Confusion Matrix
A much better way to evaluate the performance of a classifier is to look at the confusion matrix.
A confusion matrix provides a clear and concise summary of a model's predictions by showing the count of true positive, true negative, false positive, and false negative predictions.
- True positives (TP) represent correctly predicted positive instances,
- True negatives (TN) are correctly predicted negative instances,
- False positives (FP) are instances wrongly predicted as positive, and
- False negatives (FN) are instances wrongly predicted as negative.
Analyzing these values helps in assessing the model's accuracy, precision, recall, and other performance metrics, enabling data scientist to gain valuable insights into the model's strengths and weaknesses.
To compute the confusion matrix, you first need to have a set of predictions so that they can be compared to the actual targets.
from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)
In the above code, cross_val_predict() performs K-fold cross-validation and returns the predictions made on each fold. To the confusion_matrix() function, just pass the target classes (y_train_5) and the predicted classes (y_train_pred)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_train_5, y_train_pred)
cm
A perfect classifier would have only true positives and true negatives, so its confusion matrix would have nonzero value on its main diagonal (top left to bottom right):
Precision, Recall, and F1-score
Precision and Recall are two crucial performance metrics for evaluating the effectiveness of classification models, especially in scenarios where imbalanced classes or different costs for false positives and false negatives are at play.
Precision measures the accuracy of positive predictions made by a model. A high precision indicates that when the model predicts a positive income, it is likely to be correct, minimizing false positives. Precision is essential in situations where false positives have high costs, such as in medical diagnoses or fraud detection.
On the other hand, recall, also known as sensitivity or true positive rate, gauges the model's ability to capture all positive instances.
High recall means the model is effective at identifying most of the positive cases, reducing false negatives. Recall is vital in scenarios where missing a positive instance is costly, like in search and rescue operations or disease detection.
It is often convenient to combine precision and recall into a single metric called the F1 score. The F1 score is the harmonic mean of precision and recall. Whereas the regular mean treats all values equally, the harmonic mean gives gives much more weight to low values. As a result, the classifier will only get a high F1-score if both precision and recall are high.
To compute the F1 score, simply call f1_score() function:
Precision/ Recall Trade-off
The precision/recall tradeoff is a fundamental concept in machine learning that reflects the delicate balance between two critical performance metrics: precision and recall. Increasing precision typically comes at the cost of reduced recall, and vice versa.
When we adjust the threshold for classification in a model, we affect its ability to make positive predictions. A higher threshold tends to improve precision by naming the model more conservative in its positive predictions, but it may miss some true positive cases, resulting in lower recall. Conversely, a lower threshold increases recall by capturing more positive instances but may also lead to false positives, reducing precision. Striking the right balance depends on the specific goals and requirements of the project.
Coming back to our problem:
Let's raise the threshold:
How do you decide which threshold to use? First, use the cross_val_predict() function to get the scores of all instances in the training set, but this time specify that you want to return decision scores instead of predictions:
y_scores = cross_val_predict(sgd_clf, X_train, y_train_5,
cv=3, method='decision_function')
With these scores, use the precision_recall_curve() function to compute precision and recall for all possible thresholds:
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)
Now, plot a precision and recall against the decision threshold.
![]() |
| Figure 3. Precision and recall vs. the decision threshold |
Another way to select a good precision/recall trade-off is to plot precision directly against recall.
You can see that precision really starts to fall sharply around 80% recall. You might want to select a precision-recall trade-off just before that drop - for example, at around 60% recall. But of course, the choice depends on your project.
The ROC Curve
The Receiver Operating Characteristic (ROC) Curve is a graphical representation used to assess the performance of binary classification models. It displays the trade-off between the true positive rate (sensitivity) and the false positive rate (1 - specificity) as the classification threshold is varied. A well-constructed ROC curve shows how well a model can distinguish between the positive and negative classes, with an ideal model having a curve that hugs the top-left corner (high sensitivity and low false positive rate).
The area under the ROC curve (AUC-ROC) is a common metric used to quantify the overall performance of a model, where higher values indicate better discrimination.
Data scientists and machine learning practitioners use ROC curves and AUC-ROC to compare and select the most suitable model for various applications, particularly when handling imbalanced datasets or when the cost of false positives and false negatives varies.
To plot the ROC curve, you first use the roc_curve() function to compute the TPR and FPR for various threshold values:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)
| Figure 5. This ROC curve plots the false positive rate against the true positive rate for all possible thresholds; the black circle highlights the chosen ratio |
A perfect classifier will have a ROC AUC equal to 1, whereas a purely random classifier will have a ROC AUC equal to 0.5. Your model's ROC AUC will lie in between, the higher the better.
Let's now train a RandomForestClassifier and compare its ROC curve and ROC AUC score to those of the SGDClassifier. First, you need to get scores for each instance in the training set.
from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3,
method='predict_proba')
The roc_curve() function expects labels and scores, but instead of scores you can give it class probabilities. Let's use the positive class's probability as the score:
y_scores_forest = y_probas_forest[:, 1]
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5, y_scores_forest)
As you can see in Figure 6, the RandomForestClassifier's ROC curve looks much better than the SGDClassifier's: it comes much closer to the top-left corner. Its ROC AUC score as below:
Looking at the precision and recall as well:
4. Multiclass Classification
Whereas binary classifiers distinguish between two classes, multiclass classifiers (also called multinomial classifiers) can distinguish between more than two classes.
If you call the decision_function() method, you will see that it returns 10 scores per instance (instead of just 1). That's one score per class:
5. Error Analysis
Try out multiple models (shortlisting the best ones and fine-tuning their hyperparameters using GridSearchCV), and automate as much as possible. Here, we will assume that you have found a promising model and you want to find ways to improve it. One way to do this is to analyze the types of errors it makes.
Plotting the confusion matrix below:
from sklearn.metrics import ConfusionMatrixDisplay
y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
ConfusionMatrixDisplay.from_predictions(y_train, y_train_pred,
normalize="true", values_format=".0%")
This confusion matrix looks pretty good, since most images are on the main diagonal, which means they are classified correctly.
Let's focus the plot on the errors. First, you need to divide each value in the confusion matrix by the number of images in the corresponding class so that you can compare error rates instead of absolute numbers of errors.
Filling the diagonal with zeros to keep only the errors, and plot the result:
You can clearly see the kinds of errors the classifier makes. Remember that rows represent actual classes, while columns represent predicted classes.
The column for class 8 is quite bright, which tells you that many images are misclassified as 8s (65%). However, the row for class 8 is not that bad, telling you that actual 8s in general get properly classified as 8s. You can also see that 3s and 5s often get confused (in both directions).
Looking at this plot, it seems that you should focus on reducing the false 8s. For example, you can try to gather more training data for digits that look like 8s (but are not 8s) so that the classifier can learn to distinguish them from real 8s.
Let's plot examples of 3s and 5s:
cl_a, cl_b = '3', '5'
X_ab = X_train[(y_train == cl_a) & (y_train_pred == cl_b)]
X_aa = X_train[(y_train == cl_a) & (y_train_pred == cl_a)]
X_ba = X_train[(y_train == cl_b) & (y_train_pred == cl_a)]
X_bb = X_train[(y_train == cl_b) & (y_train_pred == cl_b)]
| Figure 7. The model confusing between 3s and 5s |
6. Multilabel Classification
Multilabel classification is a machine learning task where a model assigns multiple labels or categories to each data point, as opposed to traditional single-label classification where each data point belongs to a single category. This type of classification is prevalent in various real-world scenarios, such as tagging content, topic labeling in text, and object recognition in images, where multiple labels may be relevant to describe the data accurately.
Consider a face-recognition classifier: what should it do if it recognizes several people in the same picture? It should attach one tag per person it recognizes. Say the classifier has been trained to recognize three faces: Alice, Bob and Charlie. Then when the classifier is shown a picture of Alice and Charlie, it should output [1, 0, 1] (meaning "Alice yes, Bob no, Charlie yes"). The classifier assigns two labels in the particular image, that it identifies Alice and Bob in that image.
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
y_train_large = (y_train >= '7')
y_train_odd = (y_train.astype('int8') % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)
This code creates a y_multilabel array containing two target labels for each digit image: the first indicates whether or not the digit is large (7, 8, or 9), and the second indicates whether or not it is odd. The next lines create a KNeighborsClassifier instance (which supports multilabel classification), and we train it using the multiple targets array.
Now making a prediction, and notice that it outputs two labels:
Evaluating the multilabel classifier by measuring the F1 score.
7. Multioutput Classification
The last type of classification that we are going to discuss here is called multioutput-multiclass classification (or simply multioutput classification), is a specialized machine learning task where the goal is to predict output variables for each input data point. Each output variable can take on multiple possible values, similar to multiclass classification.
However, in multioutput classification, the output variables are not independent and there may be relationships or dependencies among them. This task is often encountered in various domains, including image processing, natural language processing, and remote sensing, where a single input may correspond to multiple distinct output variables or attributes.
To illustrate this, let's build a system that removes noise from images. It will take as input a noisy digit image, and it will (hopefully) output a clean digit image, represented as an array of pixel intensities, just like the MNIST images. Notice that the classifier's output is multilabel (one label per pixel) and each label can have multiple values (pixel intensity ranges from 0 to 255). It is thus an example of a multioutput classification system.
Let's start by creating the training and test sets by taking the MNIST images and adding noise to their pixel intensities with NumPy's randint() function. The target images will be the original images:
np.random.seed(42)
noise = np.random.randint(0, 100, (len(X_train), 784))
X_train_mod = X_train + noise
noise = np.random.randint(0, 100, (len(X_test), 784))
X_test_mod = X_test + noise
y_train_mod = X_train
y_test_mod = X_test
Now let's train the classifier and make it clean this image:
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train_mod, y_train_mod)
clean_digit = knn_clf.predict([X_test_mod[0]])
plot_digit(clean_digit)
8. Conclusion
Classification is a fundamental task in machine learning with widespread applications across various domains. Understanding the key concepts, algorithms, and challenges associated with classification is essential for building accurate and effective models.
Stay tuned for more interesting topics on machine learning!

Comments
Post a Comment