Unlocking the Power of Ensemble Learning: A Comprehensive Guide

In the world of machine learning, one of the most powerful techniques at your disposal is ensemble learning. An ensemble is a group of predictors. Ensemble Learning is not just a single algorithm but a strategy that combines multiple models to improve predictive performance and enhance the overall robustness of a machine learning system. The idea behind ensemble learning is that by combining the wisdom of multiple models, you can reduce errors, improve generalization, and build more robust models.

In this blog, we will dive deep into ensemble learning, understand its principles, explore different ensemble methods, and work on real-world problems.

Figure 1. Decision Tree

The Jupyter Notebook implementation can be found here.

Diversity of Models - The key to a successful ensemble is diversity. The individual base models should make different errors on the same data. This diversity can be achieved through various means, such as using different algorithms, subsets of the data, or varying the hyperparameters.
Independence - The individual models should be as independent as possible. If they make errors due to the same underlying reasons, combining them may not yield significant improvements.
Combining Predictions - Once you have your diverse set of models, you combine their predictions using a specific technique. This combination can be done through various methods like averaging, weighted averaging, or more sophisticated techniques like boosting and bagging.

2. Advantages of Ensemble Learning

Improved Accuracy - Ensembles often outperform individual models, especially when dealing with complex, noisy, or imbalanced data.
Robustness - Ensembles are more robust to overfitting, as the errors of individual models tend to cancel each other out.
Versatility - Ensemble methods can be applied to various machine learning algorithms
Wide Applicability - Ensembles are used in both classification and regression problems, making them suitable for many real-world scenarios.

3. Voting Classifiers

A voting classifier is a simple yet effective ensemble learning method that combines the predictions of multiple individual classifiers, such as decision trees, support vector machines, or logistic regression models. Voting classifiers are particularly valuable when you have multiple well-performing base models, as they can enhance prediction accuracy and increase the model's reliability.

There are two main types of voting classifiers: hard voting, where the majority vote is chosen; and soft voting, where the class label with the highest average probability is selected. Soft voting tends to perform better when base models provide probability estimates, making it a valuable tool in ensemble learning for tasks like classification.

Figure 2. Training diverse classifiers

from sklearn.datasets import make_moons
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

voting_clf = VotingClassifier(
    estimators=[
        ('lr', LogisticRegression(random_state=42)),
        ('rf', RandomForestClassifier(random_state=42)),
        ('svc', SVC(random_state=42))
    ]
)
voting_clf.fit(X_train, y_train)

Let's look at each classifier's accuracy on the test set:

for name, clf in voting_clf.named_estimators_.items():
  print(name, "=", clf.score(X_test, y_test))

There you have it! The voting classifier slightly outperforms all the individual classifiers.

4. Bagging and Pasting

Bagging and Pasting are ensemble learning techniques that help improve the performance of machine learning models by training multiple base models on different subsets of the training data.

Bagging (Bootstrap Aggregating) involves randomly sampling the training data with replacement, allowing the same sample to be included multiple times in various subsets.

Pasting, on the other hand, involves sampling the data without replacement, ensuring that each sample is included only once in different subsets.

Both methods reduce the risk of overfitting and enhance the model's generalization capabilities by combining the predictions of these independently trained models, such as decision trees or neural networks, to make more robust and accurate predictions.

Bagging and Pasting in Scikit-Learn

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500,
                            max_samples=100, n_jobs=-1, random_state=42)
bag_clf.fit(X_train, y_train)

Figure 3 compares the decision boundary of a single Decision Tree with the decision boundary of a bagging ensemble of 500 trees (from the preceding code), both trained on the moons dataset. As you can see, the ensemble's predictions will likely generalize much better than the single Decision Tree's predictions: the ensemble has a comparable bias but a smaller variance.

Figure 3. A single Decision Tree (left) versus a bagging ensemble of 500 trees (right)

Bootstrapping introduces a bit more diversity in the subsets that each predictor is trained on, so bagging ends up with a slightly higher bias than pasting; but the extra diversity also means that the predictors end up being less correlated, so the ensemble' variance is reduced. Overall, bagging often results in better models, which explains why it is generally preferred.

Out-of-Bag Evaluation

With bagging, some instances may be sampled several times for any given predictor, while others may not be sampled at all. By default, a BaggingClassifier samples m training instances with replacement (bootstrap=True), where m is the size of the training set. The training samples that are not sampled are called out-of-bag (oob) instances.

Since a predictor never sees the oob instances during training, it can be evaluated on these instances without the need for a separate validation set. You can evaluate the ensemble itself by averaging out the oob evaluations of each predictor.

In Scikit-Learn, you can set (oob_score=True) when creating a BaggingClassifier to request an automatic oob evaluation after training. The following code demonstrates this.

We get 92% accuracy on the test set.

5. Random Patches and Random Subspaces

Random patches and random subspaces are variations of the bagging ensemble technique that offer more flexibility regarding feature selection and data sampling. These techniques are especially useful when dealing with high-dimensional data or datasets with many irrelevant features, as they promote diversity among base models and help improve model performance by reducing overfitting and increasing predictive accuracy.

In random patches, subsets of both the training data and features are sampled randomly, creating a diverse set of models that learn from different portions of the dataset with varying feature sets.

Random subspaces, on the other hand, randomly select subsets of features while using the entire training dataset.

Sampling features result in even more predictor diversity, trading a bit more bias for a lower variance.

6. Random Forests

You can train a group of Decision Tree classifiers, each on a different random subset of the training set. To make predictions, you obtain the predictions of all the individual trees, then predict the class that gets the most votes. Such an ensemble of Decision Trees is called a Random Forest.

The Random Forest algorithm introduces extra randomness when growing trees; instead of searching for the best feature when splitting a node, it searches for the best feature among a random subset of features. The algorithm results in greater tree diversity, which again trades a higher bias for a lower variance, generally yielding an overall better model.

from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16,
                                 n_jobs=-1, random_state=42)
rnd_clf.fit(X_train, y_train)
y_pred_rf = rnd_clf.predict(X_test)

With a few exceptions, a RandomForestClassifier has all the hyperparameters of a DecisionTreeClassifier (to control how trees are grown), plus all the hyperparameters of a BaggingClassifier to control the ensemble itself.

Extra Trees and Feature Importance

When you are growing a tree in a Random Forest, at each node only a random subset of the features is considered for splitting. It is possible to make trees even more random by also using random thresholds for each feature rather than searching for the best possible thresholds (like regular Decision Trees do).

A forest of such extremely random trees is called an Extremely Randomized Trees ensemble.

Feature Importance

Yet another great quality of Random Forests is that they make it easy to measure the relative importance of each feature. Scikit-Learn measures a feature's importance by looking at how much the tree nodes that use that feature reduce impurity on average (across all trees in the forest).

For example, the following code trains a RandomForestClassifier on the iris dataset and outputs each feature's importance. It seems that the most important features are the petal length (44%) and width (43%), while sepal length and width are rather unimportant in comparison (9% and 2% respectively):

from sklearn.datasets import load_iris

iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
rnd_clf.fit(iris["data"], iris["target"])

for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
  print(name, score)

Similarly, if you train the Random Forest classifier on the MNIST data and plot each pixel's importance, you get the image represented in Figure 4 below.

Figure 4. MNIST pixel importance (according to a Random Forest classifier)

Random Forests are very handy to get a quick understanding of what features actually matter, in particular if you need to perform feature selection.

7. Boosting

Boosting refers to any Ensemble method that can combine several weak learners into a strong learner. The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor.

AdaBoost

One way for a new predictor to correct its predecessor is to pay a bit more attention to the training instances that the predecessor underfitted. This results in new predictors focusing more and more on the hard cases. This is the technique used by AdaBoost.

For example, when training an AdaBoost classifier, the algorithm first trains a base classifier (such as a Decision Tree) and uses it to make predictions on the training set. The algorithm increases the relative weight of misclassified training instances. Then, it trains a second classifier, using the updated weights, and again makes predictions on the training set, updates the instance weights, and so on. (Figure 5)

Figure 5. AdaBoost sequential training with instance weight updates

Figure 6 below shows the decision boundaries of five consecutive predictors on the moons dataset. The first classifier gets many instances wrong, so their weights get boosted. The second classifier therefore does a better job on these instances, and so on.

Figure 6. Decision boundaries of consecutive predictors

Once all predictors are trained, the ensemble makes predictions very much like bagging or pasting, except that predictors have different weights depending on their overall accuracy on the weighted training set.

8. Conclusion

Ensemble learning is a powerful technique that harnesses the collective intelligence of multiple models to make more accurate predictions and increase the overall robustness of machine learning systems. By understanding the principles of diversity and independence and exploring various ensemble methods, you can leverage this approach to tackle complex problems in diverse domains.

As you venture further into the world of machine learning, keep in mind that ensemble learning is not a one-size-fits-all solution. The choice of the ensemble method and the composition of base models depend on the specific problem you aim to solve. So, experiment, learn, and adapt ensemble learning to your unique challenges to unlock its full potential in your machine-learning projects.

Stay tuned for more interesting topics!

Machine Learning - Its Impact and Our Future

Search This Blog