Dimensionality Reduction: A Guide to Complex Dataset

In the age of big data, we're often confronted with datasets that contain an overwhelming number of features. While more data can lead to better insights, it can also make analysis and machine-learning tasks incredibly challenging. This is where dimensionality reduction comes to the rescue. Dimensionality reduction techniques help us simplify complex datasets without losing critical information.

In this blog, we'll explore what dimensionality reduction is, its importance, common techniques, and visual implementations on the MNIST dataset.

Figure 1. Reducing a higher dimensional dataset into lower dimensions

The Jupyter Notebook implementation can be found here.

Dimensionality reduction is the process of reducing the number of random variables under consideration. In simpler terms, it's about simplifying the complexity of data by transforming it from a high-dimensional space to a lower-dimensional one. This is done while preserving as much relevant information as possible.

Why is dimensional reduction important?

Curse of Dimensionality - As the number of features in a dataset increases, the volume of the data increases exponentially. This makes computations slower and can lead to overfitting when building machine learning models.
Improved Visualization - Lower-dimensional data is easier to visualize. By reducing data to 2D or 3D, we can visualize relationships between data points more effectively.
Noise Reduction - High-dimensional data often contains noise or irrelevant features. Dimensionality reduction can help filter out this noise, making the data cleaner.

Practical Applications:

Image and Video Processing - In image and video analysis, dimensionality reduction techniques are used to reduce the complexity of the data while preserving essential information.
Text Analysis - Dimensionality reduction is valuable in natural language processing for tasks like sentiment analysis, text classification, and document clustering. Reducing the dimensionality of word embeddings can improve the efficiency and effectiveness of these tasks
Genomics - Genomic data often comes with a vast number of features, making analysis and modeling challenging. Dimensionality reduction techniques are employed to simplify this data, aiding in tasks like gene expression analysis and disease classification.
Recommendation Systems - In recommendation systems like those used by Netflix and Amazon, dimensionality reduction helps reduce the complexity of user-product interaction data, making personalized recommendations more efficient.

2. The Curse of Dimensionality

The curse of dimensionality is a concept that describes the challenges and issues that arise as the dimensionality or the number of features in a dataset increases.

There's just plenty of space in high dimensions. As a result, high-dimensional datasets are at risk of being very sparse: most training instances are likely to be far away from each other. The more dimensions the training set has, the greater the risk of overfitting it, which means making predictions becomes much less reliable than in lower dimensions.

3. Main Approaches for Dimensionality Reduction

Projection

In most real-world problems, training instances are not spread out uniformly across all dimensions. Many features are almost constant, while others are highly correlated. As a result, all training instances lie within (or close to) a much lower-dimensional subspace of the high-dimensional space. In Figure 2, you can see a 3D dataset represented by circles.

Figure 2. A 3D dataset lying close to a 2D subspace

Notice that all training instances lie close to a plane: this is a lower-dimensional (2D) subspace of the high-dimensional (3D) space. If we project every training instance perpendicularly onto this subspace, we get the new 2D dataset shown in Figure 3 below.

Figure 3. The new 2D dataset after projection

However, projection is not always the best approach to dimensionality reduction. In many cases, the subspace may twist and turn, such as the famous Swiss roll toy dataset in Figure 4.

Figure 4. Swiss roll dataset

Simply projecting onto a plane (e.g., by dropping x3) would squash different layers of the Swiss roll together, as shown on the left side of Figure 5. What you really want is to unroll the Swiss roll to obtain the 2D dataset on the right side of Figure 5.

Figure 5. Squashing by projecting onto a plane (left) versus unrolling the Swiss roll (right)

Manifold Learning

The Swiss roll is an example of a 2D manifold. Put simply, a 2D manifold is a 2D shape that can be bent and twisted in a higher-dimensional space. More generally, a d-dimensional manifold is a part of an n-dimensional space (where d<n) that locally resembles a d-dimensional hyperplane. In this case of the Swiss roll, d=2 and n=3: it locally resembles a 2D plane, but it is rolled in the third dimension. This is the concept of manifold learning.

It relies on the manifold assumption, which holds that most real-world high-dimensional datasets lie close to a much lower-dimensional manifold.

An assumption of manifold learning is that the task at hand (e.g., classification or regression) will be simpler if expressed in the lower-dimensional space of the manifold. For example, in the top row of Figure 6, the Swiss roll is split into two classes: in the 3D space (on the left), the decision boundary would be fairly complex, but in the 2D unrolled manifold space (on the right), the decision boundary is a straight line.

However, this implicit assumption does not always hold. For example, in the bottom row of Figure 6, the decision boundary is located at x1=5.

Figure 6. The decision boundary may not always be simpler with lower dimensions

In short, reducing the dimensionality of your training set before training a model will usually speed up training, but it may not always lead to a better or simpler solution; it all depends on the dataset.

4. PCA (Principal Component Analysis)

Principal Component Analysis (PCA) is by far the most popular dimensionality reduction algorithm. PCA transforms high-dimensional data into a new coordinate system, where the first principal component explains the most variance in the data, the second explains the second most, and so on.

PCA is valuable for simplifying complex datasets, reducing redundancy, and identifying the most important features. It finds applications in data compression, noise reduction, and visualization, making it an essential tool in exploratory data analysis and machine learning.

Preserving the Variance

Before you can project the training set onto a lower-dimensional hyperplane, you first need to choose the right hyperplane. For example, a simple 2D dataset is represented on the left in Figure 7, along with three different axes (i.e., 1D hyperplanes). On the right is the result of the projection of the dataset onto each of these axes.

As you can see, the projection onto the solid line preserves the maximum variance, while the projection onto the dotted line preserves very little variance and the projection onto the dashed line preserves an intermediate amount of variance.

Figure 7. Selecting the subspace to projection

It seems reasonable to select the axis that preserves the maximum amount of variance, as it will most likely lose less information than the other projections. The choice of this axis minimizes the mean squared distance between the original and its projections onto that axis.

Principal Components

PCA identifies the axis that accounts for the largest amount of variance in the training set. In Figure 7, it is the solid line. It also finds a second axis, orthogonal to the first one, that accounts for the largest amount of remaining variance. In this 2D example, there is no choice: it is the dashed line. If it were a higher-dimensional dataset, PCA would also find a third axis, orthogonal to both previous axes, and a fourth, a fifth, and so on - as many axes as the number of dimensions in the dataset.

The ith axis is called the ith principal component (PC) of the data. In Figure 7, the first PC is the axis on which vector c1 lies and the second PC is the axis on which vector c2 lies.

Using Scikit-Learn

The following code applies PCA to reduce the dimensionality of the dataset down to two dimensions:

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X2D = pca.fit_transform(X)

Explained Variance Ratio

Another useful piece of information is the explained variance ratio of each principal component, available via the explained_variance_ratio_ variable. The ratio indicates the proportion of the dataset's variance that lies along each principal component.

For example, let's look at the explained variance ratios of the first two components of the 3D dataset represented in Figure 2.

This output tells you that 75.7% of the dataset's variance lies along the first PC, and 15.1% lies along the second PC. This leaves less than 10% for the third PC, so it is reasonable to assume that the third PC probably carries little information.

Choosing the Right Number of Dimensions

Instead of arbitrarily choosing the number of dimensions to reduce down to, it is simpler to choose the number of dimensions that add up to a sufficiently large portion of the variance (e.g., 95%). Unless, of course, you are reducing dimensionality for data visualization - in that case, you will want to reduce the dimensionality down to 2 or 3.

The following code performs PCA without reducing dimensionality, then computes the minimum number of dimensions required to preserve 95% of the training set's variance:

pca = PCA()
pca.fit(X_train)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1   # d equals 154

You could then set n_components=d and run PCA again. But there is a much better option: instead of specifying the number of principal components you want to preserve, you can set n_components to be a float between 0.0 and 1.0, indicating the ratio of variance you wish to preserve:

pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X_train)

Yet another option is to plot the explained variance as a function of the number of dimensions, in Figure 8. There will usually be an elbow in the curve, where the explained variance stops growing fast. In this case, you can see that reducing the dimensionality down to about 100 dimensions wouldn't lose too much-explained variance.

Figure 8. Explained variance as a function of the number of dimensions

PCA for Compression

After dimensionality reduction, the training set takes up much less space. As an example, try applying PCA to the MNIST dataset while preserving 95% of its variance. You should find that each instance will have just over 150 features, instead of the original 784 features. So, while most of the variance is preserved, the dataset is now less than 20% of its original size! This is a reasonable compression ratio, and you can see how this size reduction can speed up a classification algorithm tremendously.

It is also possible to decompress the reduced dataset back to 784 dimensions by applying the inverse transformation of the PCA projection. This won't give you back the original data, since the projection lists a bit of information (within the 5% variance that was dropped), but it will likely be close to the original data.

The mean squared distance between the original data and the reconstructed data (compressed and then decompressed) is called the reconstruction error.

The following code compresses the MNIST dataset down to 154 dimensions, then uses the inverse_transform() method to decompress it back to 784 dimensions:

Figure 9 shows a few digits from the original training set (on the left), and the corresponding digits after compression and decomposition (on the right). You can see that there is a slight image quality loss, but the digits are mostly still mostly intact.

Figure 9. MNIST compression that preserves 95% of the variance

5. Kernel PCA

The kernel trick is a mathematical technique that implicitly maps instances into a very high-dimension space (called the feature space), enabling it to capture nonlinear relationships in the data, which is known as Kernal PCA (kPCA). It is often good at preserving clusters of instances after projection, or sometimes even unrolling datasets close to a twisted manifold.

from sklearn.decomposition import KernelPCA

rbf_pca = KernelPCA(n_components=2, kernel="rbf", gamma=0.04, random_state=42)
X_reduced = rbf_pca.fit_transform(X_swiss)

Figure 10 shows the Swiss roll, reduced to two dimensions using a linear kernel (equivalent to simply using the PCA class), an RBF kernel, and a sigmoid kernel.

Figure 10. Swill roll reduced to 2D using kPCA with various kernels

Selecting a Kernel and Tuning Hyperparameters

As kPCA is an unsupervised learning algorithm, there is no obvious performance measure to help you select the best kernel and hyperparameter values. That said, dimensionality reduction is often a preparation step for a supervised task, so you can use grid search to select the kernel and hyperparameters that lead to the best performance on that task.

Another approach, this time entirely unsupervised, is to select the kernel and hyperparameters that yield the lowest reconstruction error.

6. Conclusion

Dimensionality reduction is a valuable tool in the data scientist's toolbox. It allows us to handle high-dimensional data more effectively, visualize complex relationships, and improve the efficiency and accuracy of machine learning models. However, it's essential to choose the right techniques for the specific problem at hand and understand the trade-offs involved. When used wisely, dimensionality reduction can unlock valuable insights from seemingly insurmountable datasets.

Stay tuned for more interesting topics!

Machine Learning - Its Impact and Our Future

Search This Blog