Understanding Support Vector Machines (SVM): A Comprehensive Guide

Support Vector Machines (SVM) are a powerful and versatile class of supervised machine learning algorithms, capable of performing linear or nonlinear classification, regression, and even outlier detection. They have found applications in various domains, from image recognition to finance. In this blog, we'll dive into the world of SVM, exploring what they are, how they work, their strengths, and some everyday use cases.

We will use the IRIS dataset to visualize how SVMs work in general. The code implementation and graphical representations can be found in the notebook here.

Feature Space - In SVM, data is represented in a high-dimensional feature space. Each data point corresponds to a vector of features. The goal is to find a hyperplane that best separates these points based on the features.
Decision Boundary - The decision boundary is the hyperplane that separates the data into different classes. In a 2D feature space, this is a simple line; in a 3D feature space, it is a plane. while in higher dimensions, it becomes a hyperplane.
Support Vectors - Support vectors are the data points closest to the decision boundary. They play a crucial role in defining the margin and the decision boundary itself.
Margin - The margin is the distance between the decision boundary and the nearest support vectors. SVM aims to maximize this margin while minimizing the classification error.
Kernel Trick - SVMs can handle both linear and nonlinear data. For nonlinear data, a kernel function is used to map the data into a higher-dimensional space where a linear decision boundary can be found. Common kernels include the linear, polynomial, and radial basis function (RBF) kernels.

2. Strengths of SVM

SVMs offer several advantages that make them popular in various machine-learning applications:

Effective in High-Dimensional Spaces - SVMs perform well even when the number of features is much larger than the number of samples.
Robust to Outliers - SVMs are relatively robust to outliers, thanks to the concept of maximizing the margin. Outliers do not significantly affect the placement of the decision boundary.
Versatile - SVMs can be used for both classification and regression tasks. The choice of kernel allows for modeling nonlinear relationships.
Strong Generalization - SVMs aim to maximize the margin, which often leads to better generalization on unseen data.

3. Use Cases of SVM

SVMs find applications in a wide range of domains. Here are some common use cases:

Image Classification - SVMs are used in recognizing handwritten digits, object detection, and face recognition.
Text Classification - In natural language processing, SVMs are applied to text classification problems, like spam email detection and sentiment analysis.
Financial Forecasting - SVMs can be employed in predicting stock prices and financial market trends.
Medical Diagnosis - SVMs are used for disease diagnosis and the classification of medical images, such as MRI scans.

4. Linear SVM Classification

Figure 1. Large margin classifier

Figure 1 shows part of the iris dataset. The two classes can clearly be separated easily with a straight line (they are linearly separable). The left plot shows the decision boundaries of three possible classifiers. The model whose decision boundary is represented by the dashed line is so bad that it does not even separate the classes properly. The other two models work perfectly on this training set, but their decision boundaries come so close to the instances that these models will probably not perform as well on new instances.

In contrast, the solid line in the plot on the right represents the decision boundary of an SVM classifier. The line not only separates the two classes but also stays as far away from the closest training instances as possible.

Notice that the decision boundary is fully determined (or "supported") by the instances located on the edge of both classes. These instances are called the support vectors. So adding more training instances will not affect the decision boundary at all.

Figure 2. Sensitivity to feature scales

Since SVMs make use of distances, hence they are sensitive to the feature scales. In Figure 2, in the left plot, the vertical scale is much larger than the horizontal scale. After feature scaling (e.g., using Scikit-Learn's StandardScaler), the decision boundary in the right plot looks much better.

Soft Margin Classification

If we strictly impose that all instances must be off the street and on the right side, this is called hard margin classification. There are two main issues with hard margin classification. First, it only works if the data is linearly separable. Second, it is sensitive to outliers.

In Figure 3, one additional outlier on the left plot, it is impossible to find a hard margin; on the right, the decision boundary ends up very different.

Figure 3. Hard margin sensitivity to outliers

To avoid these issues, you can use a more flexible model. The objective is to find a good balance between keeping the separation as large as possible and limiting the margin violations (i.e., instances that end up in the middle of the street or even on the wrong side.) This is called soft margin classification.

When creating an SVM model, we can specify a number of hyperparameters. C is one of those hyperparameters. If we set it to a low value, then we get the model on the left in Figure 4. With a high value, we get the model on the right. Margin violations are bad. However, in this case, the model on the left has a lot of margin violations but will probably generalize better.

Figure 4. Large margin (left) vs fewer margin violations (right)

If your SVM model is overfitting, try regularizing it by reducing C.

5. Nonlinear SVM Classification

Although linear SVM classifiers are efficient and work surprisingly well in many cases, many datasets are not even close to being linearly separable. One approach to handling nonlinear datasets is to add more features, such as polynomial features.

Consider the left plot in Figure 5: it represents a simple dataset with just one feature, x1. This dataset is not linearly separable. But if you add a second feature x2 = (x1)2, the resulting 2D dataset is perfectly linearly separable (red color).

Figure 5. Adding features to make a dataset linearly separable

To implement this idea, create a Pipeline containing containing PolynomialFeatures, followed by a StandardScaler and a LinearSVC. Let's test this on the moons dataset: this is a dataset for binary classification in which the data points are shaped as two interleaving half circles.

from sklearn.datasets import make_moons
from sklearn.preprocessing import PolynomialFeatures

X, y = make_moons(n_samples=100, noise=0.15, random_state=42)

polynomial_svm_clf = make_pipeline(
    PolynomialFeatures(degree=3),
    StandardScaler(),
    LinearSVC(C=10, max_iter=10_000, random_state=42)
)

polynomial_svm_clf.fit(X, y)

Figure 6. Linear SVM classifier using polynomial features

Polynomial Kernel

At a low polynomial degree, it is hard to capture the complex features of a dataset. While having higher degrees can increase the computational cost. The solution is the kernel trick.

The kernel trick is a fundamental concept in machine learning, especially in the context of Support Vector Machines (SVMs). It enables SVMs to handle nonlinear data by transforming it into a high-dimensional space where a linear decision boundary can be found. Instead of explicitly mapping data points to this higher-dimensional space, the kernel function computes the dot product between the transformed points, saving computational resources and making complex tasks feasible.

poly100_kernel_svm_clf = make_pipeline(
    StandardScaler(),
    SVC(kernel="poly", degree=10, coef0=100, C=5)
)
poly100_kernel_svm_clf.fit(X, y)

This code trains an SVM classifier using a third-degree polynomial kernel, represented on the left. On the right is another SVM classifier using a 10th-degree polynomial kernel.

If your model is overfitting, you might want to reduce the polynomial degree. Conversely, if it is underfitting, you can try increasing it.

Figure 7. SVM classifiers with a polynomial kernel

A common approach to finding the right hyperparameter values is to use grid search. It is often faster to first do a very coarse grid search, then a finer grid search around the best values found.

Gaussian RBF Kernel

Let's try the SVC class with the Gaussian RBF kernel:

rbf_kernel_svm_clf = make_pipeline(StandardScaler(),
                                   SVC(kernel="rbf", gamma=5, C=0.001))
rbf_kernel_svm_clf.fit(X, y)

The plots show models trained with different values of hyperparameters gamma and C. Increasing gamma makes the bell-shaped curve narrower. As a result, each instance's range of influence is smaller. Conversely, a smaller gamma makes the bell-shaped curve wider: instances have a larger range of hyperparameter. If your model is overfitting, you should reduce it; if it is underfitting, you should increase it.

Figure 8. SVM classifiers using an RBF kernel

6. SVM Regression

As mentioned earlier, the SVM algorithm is versatile: not only does it support linear and nonlinear classification, but it also supports linear and nonlinear regression.

To use SVMs for regression instead of classification, the trick is to reverse the objective: instead of trying to fit the largest possible street between two classes while limiting margin violations, SVM Regression tries to fit as many instances as possible on the street while limiting margin violations,

The width of the street is controlled by a hyperparameter Ɛ. Figure 9 shows two linear SVM Regression models trained on some random linear data, one with a large margin (Ɛ=1.5) and the other with a small margin (Ɛ=0.5).

Figure 9. SVM Regression

Adding more training instances within the margin does not affect the model's prediction; thus, the model is said to be Ɛ-insensitive.

You can use Scikit-Learn's LinearSVR class to perform linear SVM Regression.

from sklearn.svm import LinearSVR

svm_reg = LinearSVR(epsilon=1.5)

svm_reg.fit(X, y)

Figure 10 shows SVM Regression on a random quadratic training set, using a second-degree polynomial kernel. There is little regularization in the left plot (i.e., a large C value), and much regularization in the right plot (i.e., a small C value).

Figure 10. SVM Regression using a second-degree polynomial kernel

7. Conclusion

Support Vector Machines are a powerful tool in the machine learning toolbox. Their ability to find effective decision boundaries and their versatility in handling different data types make them valuable for a wide range of applications.

Machine Learning - Its Impact and Our Future

Search This Blog