Machine learning is a powerful tool for extracting insights and making predictions from data, and one of its most intriguing subfields is unsupervised machine learning. Unlike supervised learning, where the model is trained on labeled data to predict specific outcomes, unsupervised learning involves working with unlabeled data to uncover hidden patterns and structures.
In this blog, we will delve into the fascinating world of unsupervised machine learning, exploring its key concepts, techniques, and real-world applications.
Table of Contents
- Fundamental Techniques in Unsupervised Learning
- Clustering: K-Means
- The K-Means algorithm
- Centroid Centralization methods
- Finding the optimal number of clusters
- Limits of K-Means
- Using Clustering for Image Segmentation
- DBSCAN
- Conclusion
The Jupyter Notebook for this blog can be found here.
1. Fundamental Techniques in Unsupervised Learning
The primary goal of unsupervised learning is to perform tasks such as:
- Clustering: Grouping data points into clusters based on their similarities. Standard algorithms include - K-Means, Hierarchical Clustering, and DBSCAN.
- Dimensionality Reduction: Reducing the number of features or dimensions in a dataset while preserving its essential information. Principal Component Analysis (PCA) is a popular technique.
- Anomaly Detection: Identifying unusual or outlier points that deviate significantly from the majority. Techniques like One-Class SVM are commonly used algorithms.
- Association Rule Mining: Discovering interesting relationships and patterns in transactional data, often used in market basket analysis.
- Density Estimation: Estimating the probability density function of the data, which can be useful in statistical analysis.
2. Clustering: K-Means
Clustering is aimed at grouping data points with similar characteristics into clusters. By identifying patterns and structures in unlabeled data, clustering algorithms enable us to gain insights, discover relationships, and make data-driven decisions.
Visualizing the scatter plots of the Iris dataset in Figure 1:
Consider another unlabeled dataset represented in Figure 2: you can clearly see five blobs of instance. The K-Means algorithm is a simple algorithm capable of clustering this kind of dataset very quickly and efficiently, often in just a few iterations. It aims to partition a dataset into K distinct, non-overlapping clusters.
| Figure 2. An unlabeled dataset composed of five blobs of instances |
Let's train a K-Means clusterer on this dataset. It will try to find each blob's center and assign each instance to the closest blob:
from sklearn.cluster import KMeans
k = 5
kmeans = KMeans(n_clusters=k, random_state=42)
y_pred = kmeans.fit_predict(X)
Note that you have to specify the number of clusters k that the algorithm must find. In this example, it is pretty obvious from looking at the data that k should be set to 5, but in general, it is not that easy.
Each instance was assigned to one of the five clusters. In the context of clustering, an instance's label is the index of the cluster that this instance gets assigned to.
The KMeans instance preserves a copy of the labels of the instance it was trained on, available via the labels_ instance variable:
| Figure 3. K-Means decision boundaries (Voronoi tessellation) |
Instead of assigning each instance to a single cluster, which is called hard clustering, it can be useful to give each instance a score per cluster, which is called soft clustering. The score can be the distance between the instance and the centroid; conversely, it can be a similarity or affinity score.
The K-Means algorithm
The algorithm starts by randomly initializing K cluster centroids, then iteratively assigns each data point to the nearest centroid and updates the centroids based on the mean of the data points assigned to them. This process continues till convergence, with centroids and assignments being refined in each iteration.
| Figure 4. The K-Means algorithm |
Although the algorithm is guaranteed to converge, it may not converge to the right solution (i.e., it may converge to a local optimum). Figure 5 shows two suboptimal solutions that the algorithm can converge to if you are not lucky with the random initialization step.
Centroid Centralization methods
If you want to know approximately where the centroids should be, run the algorithm multiple times with different random initializations and keep the best solution. The number of random initializations is controlled by the n_init hyperparameter: by default, it is equal to 10, which means that the whole algorithms described earlier run 10 times when you call the fit(), and Scikit-Learn keeps the best solution. It uses a performance metric, known as the model's inertia, to know which solution is the best. The inertia with the lowest value is the best.
Finding the optimal number of clusters
So far, we have set the number of clusters k to 5 because it was obvious by looking at the data. But in general, it will not be so easy to know how to set k, and the result might be quite bad if you set it to the wrong value. As you can see in Figure 6, setting k to 3 or 8 results in fairly bad models.
| Figure 6. Bad choices for the number of clusters: when k is too small, separate clusters get merged (left), and when k is too large, some clusters get chopped into multiple pieces (right) |
You might be thinking that we could just pick the model with the lowest inertia. Unfortunately, it is not that simple. The inertia for k=3 is 653.2, which is much higher than for k=5 (which was 211.6). But with k=8, the inertia is just 119.1.
The inertia is not a good performance metric when trying to choose k because it keeps getting lower as we increase k. Indeed, the more clusters there are, the closer each instance will be to its closest centroid, and therefore the lower the inertia will be. Let's plot the inertia as a function of k (Figure 7).
| Figure 7. When plotting the inertia as a function of the number of clusters k, the curve often contains an inflection point called the "elbow" |
As you can see, the inertia drops very quickly as we increase k up to 4, but then it decreases much more slowly as we keep increasing k. This curve has roughly the shape of an arm, and there is an "elbow" at k=4.
This technique for choosing the best value for the number of clusters is rather coarse. A more precise approach (but also more computationally expensive) is to use the silhouette score, which is the mean silhouette coefficient over all the instances, which can vary between -1 and +1. A coefficient close to +1 means that the instance is well inside its own cluster and far from other clusters, while a coefficient close to 0 means that it is close to a cluster boundary, and finally a coefficient close to -1 means that the instance may have been assigned to the wrong cluster.
Computing the silhouette score:
Let's compare the silhouette scores for different numbers of clusters (Figure 8).
As you can see, this visualization is much richer than the previous one: although it confirms that k=4 is a very good choice. It also underlines the fact that k=5 is quite good as well, and much better than k=6 or 7. This was not visible when comparing inertia.3. Limits of K-Means
Despite its many merits, mostly notably being fast and scalable, K-Means is not perfect. As we saw, it is necessary to run the algorithm several times to avoid suboptimal solutions, plus you need to specify the number of clusters which can be quite a hassle.
Moreover, K-Means does not behave very well when the clusters have varying sizes, different densities, or nonspherical shapes.
Figure 9 shows how K-Means clusters a dataset containing three ellipsoidal clusters of different dimensions, densities, and orientations.
As you can see, neither of these solutions is any good. The solution on the left is better, but it still chops off 25% of the middle cluster and assigns it to the cluster on the right. The solution on the right is just terrible, even though its inertia is lower.
4. Using Clustering for Image Segmentation
Image segmentation is the task of partitioning an image into multiple segments. We are going to do something much simpler: color segmentation. We will simply assign pixels to the same segment if they have a similar color.
The image is represented as a 3D array. The first dimension's size is the height; the second is the width; and the third is the number of color channels, in this case red, green, and blue (RGB).
The following code reshapes the array to get a long list of RGB colors, and then it clusters these colors using K-Means:
X = image.reshape(-1, 3)
kmeans = KMeans(n_clusters=8, random_state=42).fit(X)
segmented_img = kmeans.cluster_centers_[kmeans.labels_]
segmented_img = segmented_img.reshape(image.shape)
This outputs the image shown in the upper right of Figure 10. When you use fewer than eight clusters, notice that the ladybug's flashy red color fails to get a cluster of its own: it gets merged with colors from the environment. This is because K-Means prefers clusters of similar sizes. The ladybug is small - much smaller than the rest of the image - K-Means fails to dedicate a cluster to it.
5. DBSCAN
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is an unsupervised clustering algorithm known for its ability to find clusters of varying shapes in data. Unlike K-Means, DBSCAN doesn't require the user to specify the number of clusters beforehand, making it a valuable tool for real-world data. DBSCAN operates by defining clusters as areas of high data point density and isolating points that lie in less dense regions as noise.
This algorithm defines clusters as continuous regions of high density. Here is how it works:
- For each instance, the algorithm counts how many instances are located within a small distance ε (epsilon) from it. This region is called the instance's ε-neighborhood.
- If an instance has at least min_samples instances in its ε-neighborhood (including itself), then it is considered a core instance. In other words, core instances are those that are located in dense regions.
- All instances in the neighborhood of a core instance belong to the same cluster. This neighborhood may include other core instances; therefore, a long sequence of neighboring core instances forms a single cluster.
- Any instance that is not a core instance and does not have one in its neighborhood is considered an anomaly.
The DBSCAN class in Scikit-Learn is simple to use. Let's test it on the moons dataset:
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=1000, noise=0.05, random_state=42)
dbscan = DBSCAN(eps=0.05, min_samples=5)
dbscan.fit(X)
The labels of all the instances are now available in the labels_ instance variable:
Notice that some instances have a cluster index equal to -1, which means that they are considered anomalies by the algorithm.
This clustering is represented in the lefthand plot in Figure 11. As you can see, it identified quite a lot of anomalies, plus seven different clusters. That is disappointing! Fortunately, if we widen each instance's neighborhood by increasing eps to 0.20, we get the clustering on the right, which looks perfect.
| Figure 11. DBSCAN clustering using two different neighborhood radii |
In short, DBSCAN is a very simple yet powerful algorithm capable of identifying any number of clusters of any shape. It is robust to outliers, and it has just two hyperparameters (eps and min_samples).
6. Conclusion
Unsupervised machine learning is a fascinating field with a broad range of applications. It excels in scenarios where labeled data is limited or nonexistent, allowing us to extract valuable insights, uncover hidden patterns, and make sense of complex, unlabeled data. As machine learning continues to evolve, unsupervised learning techniques will play a crucial role in harnessing the full potential of our data-rich world. Whether it's customer segmentation, anomaly detection, or simplifying high-dimensional data, unsupervised machine learning is a key player in the quest for knowledge and understanding in the data-driven age.Stay tuned for more interesting topics!
Comments
Post a Comment