Support Vector Machine for Binary Classification

an illustration of support vector machine and the kernel trick

After some discussions on classification, hopefully, you have had a good grasp on this task. Until now, we have only mentioned a few version of Logistic regression, whereas there are many, many other models. So, in this post, let us learn about a very famous and elegant model which is called Support Vector Machine (SVM). The model has a few variants, however, we will start with the simplest one: linear SVM for binary classification. So, let us dive in!

Toy example

We can learn how SVM works better with a simple example. This post focuses on concept explanation, so I am not including any codes here. However, you can get the complete notebook here if interested.

Assuming we collect data from 12 students on how they prepared for an exam. The features include their reading time for textbooks and their practicing time for lab assignments. The label is whether they passed or failed the exam. The data can be completely visualized in the scatter plot below, with the two axes representing features, and the marker colors representing the results.

scatter plot for example 1

I will reuse this visualization multiple times throughout this post, however, without the title and legend for simplicity and clarity.

Classification decision boundary

A simple yet effective strategy to perform classification on this data is to formulate a decision boundary that separates the two classes. In this case, the boundary can be a straight line going through the middle of the data as below. Then, any instances above the line can be predicted to be in class Passed, and those under the line, Failed.

simple classification strategy

However, if you have not realized, there are an infinite numbers of such lines in this data! Different classification models have various strategies to solve the classification task. However, most can be summarized into a decision boundary. In the figure below, from left to right, we have decision boundaries that are 1) drawn randomly, 2) from Decision Tree (we will discuss this model in the future), and 3) from logistic regression.

random decision boundaries

Random boundaries

decision boundary of a decision tree

Decision tree

decision boundary of logistic regression

Logistic regression

So, among that infinite number of boundaries, which one is the best one? How do we even define “best” here? And, that is the motivation for our linear support vector machine model.

Linear support vector machine

First, we define the margin as the distance between the two boundaries that parallel the decision line and go through the outmost instances of each class. Then, a linear SVM seeks an optimal decision boundary that maximizes the margin between the two classes. An illustration of the class boundaries, margin, and decision line from SVM is as below.

the optimization problem of support vector machine

This method of optimization allows a trained SVM to generalize to new data very well. Therefore, despite being fairly old, at this moment, SVM is still among the state-of-the-art models for supervised learning. Now, back to our example, a SVM fitted the below decision boundary. The red-border instances are those lying on the class boundaries, and are called support vectors.

decision boundary of support vector machine

At a first gland, the separation from SVM looks fairly similar to that of the previous logistic model. However, let us compare them all in the figure below. You can see that the class boundaries in logistic regression do not go through all the outmost instances of the classes, whereas those of linear SVM do. Therefore, the decision line of linear SVM is the most optimal here.

comparing decision boundaries of different models

The soft boundary problem

In the above example, any model can perfectly separate data in the two classes. However, this is not usually the case with real world data. It could be a bit messier like below. And as you can see, there are no straight lines that can separate these two classes. However, in linear SVM, the decision boundary is always a straight line.

non-separable data

This issue motivates the soft boundary problem. In short, softer boundaries allow more instances to be in the boundary region and become support vectors. The softness of SVM boundaries is controlled by a hyperparameter that we will discuss later. An example of soft and hard boundaries is as below.

example of soft and hard boundaries

Kernel support vector machine

Soft boundary alone is not enough to ensure the optimality of SVM. At times, the classes are better separated by a curve instead of a straight line which is illustrated as below. In this case, we are better off with a kernel SVM.

non linearly separable data

Kernel SVM are linear SVMs that adapt to nonlinearly separable data using the kernel trick. Mathematics aside, roughly speaking, the kernel trick means using a kernel function to map data to an implicit higher-dimensional space where instances of different classes suppose to be more linearly separable. Here, “implicit” means that we do not actually know how that higher-dimensional space looks like. We only know the instances’ pairwise similarities represented by the values of the kernel function. An example of the kernel trick is as follows.

an illustration of the kernel trick

The are many kernel functions for SVM. Two of the more common ones are polynomial kernel and radial basis function (RBF) kernel. Both comes with hyperparameters that we need to tune, so we will discuss them at a later time. For illustration purpose though, below are the boundaries and support vectors fitted by linear SVM, RBF kernel SVM, and polynomial kernel SVM. You can see that both kernels successfully separate the two classes in this case, albeit their decision forms are very different.

decision boundaries from different versions of support vector machine

The general case

All examples we have had in this post are with data having only two features. Of course, SVM can model more dimensions than that. The only reason for the use of only two features is for simplicity and visualization purpose. In higher dimensional data, SVM works exactly the same. It seeks a decision boundary, now called a hyperplane that optimally separate instances between the two classes while maximizing the classes’ margin. The soft boundary technique and kernel trick are also applying the same to such cases.

Wrapping up

In this post, we have learned about support vector machine, a very elegant and powerful model for data analytics. Of course, you can see that we have not discussed using SVM in actual data at all. That would be the story for another day though, since I just want to focus on explaining the concept of SVM in this post. So, until next time!

2 Comments

Comments are closed