Skip to main content

Dimensionality Reduction: A Guide to Complex Dataset

In the age of big data, we're often confronted with datasets that contain an overwhelming number of features. While more data can lead to better insights, it can also make analysis and machine-learning tasks incredibly challenging. This is where dimensionality reduction comes to the rescue. Dimensionality reduction techniques help us simplify complex datasets without losing critical information. 

In this blog, we'll explore what dimensionality reduction is, its importance, common techniques, and visual implementations on the MNIST dataset.

Figure 1. Reducing a higher dimensional dataset into lower dimensions

The Jupyter Notebook implementation can be found here.

Table of Contents:

  1. What is Dimensionality Reduction?
  2. The Curse of Dimensionality
  3. Main Approaches for Dimensionality Reduction
    • Projection
    • Manifold Learning
  4. PCA (Principal Component Analysis)
    • Preserving the Variance
    • Principal Components
    • Using Scikit-Learn
    • Explained Variance Ratio
    • Choosing the Right Number of Dimensions
    • PCA for Compression
  5. Kernel PCA
    • Selecting a Kernel and Tuning Hyperparameters
  6. Conclusion

1. What is Dimensionality Reduction?

Dimensionality reduction is the process of reducing the number of random variables under consideration. In simpler terms, it's about simplifying the complexity of data by transforming it from a high-dimensional space to a lower-dimensional one. This is done while preserving as much relevant information as possible.


Why is dimensional reduction important?
  1. Curse of Dimensionality - As the number of features in a dataset increases, the volume of the data increases exponentially. This makes computations slower and can lead to overfitting when building machine learning models.
  2. Improved Visualization - Lower-dimensional data is easier to visualize. By reducing data to 2D or 3D, we can visualize relationships between data points more effectively.
  3. Noise Reduction - High-dimensional data often contains noise or irrelevant features. Dimensionality reduction can help filter out this noise, making the data cleaner.

Practical Applications:
  1. Image and Video Processing - In image and video analysis, dimensionality reduction techniques are used to reduce the complexity of the data while preserving essential information. 
  2. Text Analysis - Dimensionality reduction is valuable in natural language processing for tasks like sentiment analysis, text classification, and document clustering. Reducing the dimensionality of word embeddings can improve the efficiency and effectiveness of these tasks
  3. Genomics - Genomic data often comes with a vast number of features, making analysis and modeling challenging. Dimensionality reduction techniques are employed to simplify this data, aiding in tasks like gene expression analysis and disease classification.
  4. Recommendation Systems - In recommendation systems like those used by Netflix and Amazon, dimensionality reduction helps reduce the complexity of user-product interaction data, making personalized recommendations more efficient. 

2. The Curse of Dimensionality

The curse of dimensionality is a concept that describes the challenges and issues that arise as the dimensionality or the number of features in a dataset increases. 

There's just plenty of space in high dimensions. As a result, high-dimensional datasets are at risk of being very sparse: most training instances are likely to be far away from each other. The more dimensions the training set has, the greater the risk of overfitting it, which means making predictions becomes much less reliable than in lower dimensions. 

3. Main Approaches for Dimensionality Reduction

Projection

In most real-world problems, training instances are not spread out uniformly across all dimensions. Many features are almost constant, while others are highly correlated. As a result, all training instances lie within (or close to) a much lower-dimensional subspace of the high-dimensional space. In Figure 2, you can see a 3D dataset represented by circles.
Figure 2. A 3D dataset lying close to a 2D subspace

Notice that all training instances lie close to a plane: this is a lower-dimensional (2D) subspace of the high-dimensional (3D) space. If we project every training instance perpendicularly onto this subspace, we get the new 2D dataset shown in Figure 3 below. 
Figure 3. The new 2D dataset after projection

However, projection is not always the best approach to dimensionality reduction. In many cases, the subspace may twist and turn, such as the famous Swiss roll toy dataset in Figure 4.
Figure 4. Swiss roll dataset

Simply projecting onto a plane (e.g., by dropping x3) would squash different layers of the Swiss roll together, as shown on the left side of Figure 5. What you really want is to unroll the Swiss roll to obtain the 2D dataset on the right side of Figure 5.
Figure 5. Squashing by projecting onto a plane (left) versus unrolling the Swiss roll (right)

Manifold Learning

The Swiss roll is an example of a 2D manifold. Put simply, a 2D manifold is a 2D shape that can be bent and twisted in a higher-dimensional space. More generally, a d-dimensional manifold is a part of an n-dimensional space (where d<n) that locally resembles a d-dimensional hyperplane. In this case of the Swiss roll, d=2 and n=3: it locally resembles a 2D plane, but it is rolled in the third dimension. This is the concept of manifold learning.

It relies on the manifold assumption, which holds that most real-world high-dimensional datasets lie close to a much lower-dimensional manifold.

An assumption of manifold learning is that the task at hand (e.g., classification or regression) will be simpler if expressed in the lower-dimensional space of the manifold. For example, in the top row of Figure 6, the Swiss roll is split into two classes: in the 3D space (on the left), the decision boundary would be fairly complex, but in the 2D unrolled manifold space (on the right), the decision boundary is a straight line. 

However, this implicit assumption does not always hold. For example, in the bottom row of Figure 6, the decision boundary is located at x1=5.
Figure 6. The decision boundary may not always be simpler with lower dimensions

In short, reducing the dimensionality of your training set before training a model will usually speed up training, but it may not always lead to a better or simpler solution; it all depends on the dataset.

4. PCA (Principal Component Analysis)

Principal Component Analysis (PCA) is by far the most popular dimensionality reduction algorithm. PCA transforms high-dimensional data into a new coordinate system, where the first principal component explains the most variance in the data, the second explains the second most, and so on. 

PCA is valuable for simplifying complex datasets, reducing redundancy, and identifying the most important features. It finds applications in data compression, noise reduction, and visualization, making it an essential tool in exploratory data analysis and machine learning.

Preserving the Variance

Before you can project the training set onto a lower-dimensional hyperplane, you first need to choose the right hyperplane. For example, a simple 2D dataset is represented on the left in Figure 7, along with three different axes (i.e., 1D hyperplanes). On the right is the result of the projection of the dataset onto each of these axes. 

As you can see, the projection onto the solid line preserves the maximum variance, while the projection onto the dotted line preserves very little variance and the projection onto the dashed line preserves an intermediate amount of variance.
Figure 7. Selecting the subspace to projection

It seems reasonable to select the axis that preserves the maximum amount of variance, as it will most likely lose less information than the other projections. The choice of this axis minimizes the mean squared distance between the original and its projections onto that axis. 

Principal Components

PCA identifies the axis that accounts for the largest amount of variance in the training set. In Figure 7, it is the solid line. It also finds a second axis, orthogonal to the first one, that accounts for the largest amount of remaining variance. In this 2D example, there is no choice: it is the dashed line. If it were a higher-dimensional dataset, PCA would also find a third axis, orthogonal to both previous axes, and a fourth, a fifth, and so on - as many axes as the number of dimensions in the dataset.

The ith axis is called the ith principal component (PC) of the data. In Figure 7, the first PC is the axis on which vector c1 lies and the second PC is the axis on which vector c2 lies.

Using Scikit-Learn

The following code applies PCA to reduce the dimensionality of the dataset down to two dimensions:

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X2D = pca.fit_transform(X)

Explained Variance Ratio

Another useful piece of information is the explained variance ratio of each principal component, available via the explained_variance_ratio_ variable. The ratio indicates the proportion of the dataset's variance that lies along each principal component. 

For example, let's look at the explained variance ratios of the first two components of the 3D dataset represented in Figure 2.

This output tells you that 75.7% of the dataset's variance lies along the first PC, and 15.1% lies along the second PC. This leaves less than 10% for the third PC, so it is reasonable to assume that the third PC probably carries little information.

Choosing the Right Number of Dimensions

Instead of arbitrarily choosing the number of dimensions to reduce down to, it is simpler to choose the number of dimensions that add up to a sufficiently large portion of the variance (e.g., 95%). Unless, of course, you are reducing dimensionality for data visualization - in that case, you will want to reduce the dimensionality down to 2 or 3.

The following code performs PCA without reducing dimensionality, then computes the minimum number of dimensions required to preserve 95% of the training set's variance:
pca = PCA()
pca.fit(X_train)
cumsum = np.cumsum(pca.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1   # d equals 154

You could then set n_components=d and run PCA again. But there is a much better option: instead of specifying the number of principal components you want to preserve, you can set n_components to be a float between 0.0 and 1.0, indicating the ratio of variance you wish to preserve:

pca = PCA(n_components=0.95)
X_reduced = pca.fit_transform(X_train)

Yet another option is to plot the explained variance as a function of the number of dimensions, in Figure 8. There will usually be an elbow in the curve, where the explained variance stops growing fast. In this case, you can see that reducing the dimensionality down to about 100 dimensions wouldn't lose too much-explained variance.
Figure 8. Explained variance as a function of the number of dimensions

PCA for Compression

After dimensionality reduction, the training set takes up much less space. As an example, try applying PCA to the MNIST dataset while preserving 95% of its variance. You should find that each instance will have just over 150 features, instead of the original 784 features. So, while most of the variance is preserved, the dataset is now less than 20% of its original size! This is a reasonable compression ratio, and you can see how this size reduction can speed up a classification algorithm tremendously.

It is also possible to decompress the reduced dataset back to 784 dimensions by applying the inverse transformation of the PCA projection. This won't give you back the original data, since the projection lists a bit of information (within the 5% variance that was dropped), but it will likely be close to the original data.

The mean squared distance between the original data and the reconstructed data (compressed and then decompressed) is called the reconstruction error.

The following code compresses the MNIST dataset down to 154 dimensions, then uses the inverse_transform() method to decompress it back to 784 dimensions:

Figure 9 shows a few digits from the original training set (on the left), and the corresponding digits after compression and decomposition (on the right). You can see that there is a slight image quality loss, but the digits are mostly still mostly intact. 

Figure 9. MNIST compression that preserves 95% of the variance

5. Kernel PCA

The kernel trick is a mathematical technique that implicitly maps instances into a very high-dimension space (called the feature space), enabling it to capture nonlinear relationships in the data, which is known as Kernal PCA (kPCA). It is often good at preserving clusters of instances after projection, or sometimes even unrolling datasets close to a twisted manifold.

from sklearn.decomposition import KernelPCA

rbf_pca = KernelPCA(n_components=2, kernel="rbf", gamma=0.04, random_state=42)
X_reduced = rbf_pca.fit_transform(X_swiss)

Figure 10 shows the Swiss roll, reduced to two dimensions using a linear kernel (equivalent to simply using the PCA class), an RBF kernel, and a sigmoid kernel.

Figure 10. Swill roll reduced to 2D using kPCA with various kernels

Selecting a Kernel and Tuning Hyperparameters

As kPCA is an unsupervised learning algorithm, there is no obvious performance measure to help you select the best kernel and hyperparameter values. That said, dimensionality reduction is often a preparation step for a supervised task, so you can use grid search to select the kernel and hyperparameters that lead to the best performance on that task.

Another approach, this time entirely unsupervised, is to select the kernel and hyperparameters that yield the lowest reconstruction error.

6. Conclusion

Dimensionality reduction is a valuable tool in the data scientist's toolbox. It allows us to handle high-dimensional data more effectively, visualize complex relationships, and improve the efficiency and accuracy of machine learning models. However, it's essential to choose the right techniques for the specific problem at hand and understand the trade-offs involved. When used wisely, dimensionality reduction can unlock valuable insights from seemingly insurmountable datasets.

Stay tuned for more interesting topics!

Comments

Popular posts from this blog

A Dive into Representational Learning and Generative Models with Autoencoders and GANs

In the ever-evolving landscape of artificial intelligence, the quest for machines to understand and generate meaningful representations of data has led to remarkable breakthroughs. Representational learning , a subfield of machine learning, explores the intricate process of learning hierarchical and abstract features from raw data. Two powerful techniques that have gained significant traction in this domain are Autoencoders and Generative Adversarial Networks (GANs).  Figure 1. Generative Adversarial Network In this blog post, we will embark on a journey to explore the fascinating world of representational learning and generative models, delving into the mechanics of Autoencoders and GANs. The Jupyter Notebook for this blog can be found here . Table of Contents: Autoencoders: Unveiling Latent Representations Efficient Data Representations Performing PCA with an Undercomplete Linear Autoencoder Stacked Autoencoders Implementing a Stacked Autoencoder Using Keras Visualizing the Reco...

Reinforcement Learning: A Journey into Intelligent Decision-Making

In the ever-evolving landscape of artificial intelligence, Reinforcement Learning (RL) has emerged as a powerful paradigm, enabling machines to learn and make decisions through interaction with their environment. Let's dive into the world of reinforcement learning without further ado. Imagine training a dog named Max using treats as positive reinforcement. When Max successfully follows a command like "sit" or "stay", the owner immediately rewards him with a tasty treat. The positive association between the action and the treat encourages Max to repeat the desired behavior. Over time, Max learns to associate the specific command with the positive outcome of receiving a treat, reinforcing the training process. Figure 1. A simple example of Reinforcement Learning Table of Contents: Understanding Reinforcement Learning Key components of RL Exploring applications of RL Policy Search Neural Network Policies Types of Neural Network Policies Evaluating Actions: The Cre...

Transformative Tales: Unleashing the Power of Natural Language Processing with RNNs and Attention Mechanisms

In the ever-evolving landscape of artificial intelligence, Natural Language Processing (NLP) has emerged as a captivating frontier, revolutionizing how machines comprehend and interact with human language. Among the many tools in the NLP arsenal, Recurrent Neural Networks (RNNs) and attention mechanisms stand out as key players, empowering models to understand context, capture nuances, and deliver more sophisticated language processing capabilities.  Let's embark on a journey into the world of NLP, where the synergy of RNNs and attention mechanisms is reshaping the way machines interpret and generate human-like text. Figure 1. An RNN unrolled through time The Jupyter Notebook for this blog can be found  here . Table of Contents: What is Natural Language Processing (NLP)? Generative Shakespearean Text Using a Character RNN Creating the Training Dataset How to Split a Sequential Dataset Chopping the Sequential Dataset into Multiple Windows Building and Training the Char-RNN Mode...