Tuning Support Vector Machine

an illustration of tuning support vector machine and different hyperparamters

Previously, we discussed support vector machine and related concepts in a fair amount of details. Now, we will dig a bit deeper into this model. Of course, I will not ramble about all the mathematical details. However, any models must be tuned to achieve the optimal performances. So, in this post, we will discuss different aspects in tuning support vector machine models.

Data in demonstration

For demonstration, I will use the test_exam.csv data similar to the previous one which includes students’ reading and practicing time for an exam, and whether they passed or failed. However, the data size is bigger and not as separable between the two classes. I am still focusing more on explanation in this post and will not include too many codes. However, you can get the complete notebook here if interested. Also, the data is downloadable below.

We first load the important libraries and check a scatter plot of the three features. You can see that instances in the two groups are pretty well defined and separated. However, they overlap each other in a fairly wide region. Lastly, we extract the features as X and label as y for convenience later on.

Tuning soft margin hyperparameter

Recall, soft margins allow the boundary area of a SVM can include some instances instead of being completely empty. In SKLearn, we fits a SVM for classification with the SVC model. The degree of soft margin can be controlled with the hyperparameter C. By default, C=1, and yield the following boundaries in our data with a linear SVM model.

Lower values of C will make the margin softer, i.e., more instances can be here, and higher C means harder margins, less instances are allowed in the boundary area. Below are the illustration of boundaries from the same linear SVM model with different C. As you can see, varying C changes the margin which in turn shifts the boundaries and the model performance as well. In this case, the linear SVM get the highest CV accuracy, 94%, at C=0.001.

Tuning kernel function

A linear SVM only need its C tuned. However, we have established that, in multiple occasions, a linear SVM is not enough and we need a kernel SVM. The core of this model is the kernel trick that uses a kernel function to form a nonlinear decision boundary. These functions that take two data instances as inputs then outputs one single number representing their similarity (or dot-product). We usually denote kernel functions as K(x1,x2) with x1 and x2 being the two instances. There are different kernel functions, each has its own set of hyperparameters. I will introduce the mathematical formulas of each kernel so that you understand why they have such hyperparameters.

Polynomial kernel

Mathematically, the polynomial kernel has the following form

K(x_1, x_2) = (x_1\cdot x_2 + r)^d

With r being a constant, and d being the polynomial degree. Both are tunable parameters in SKLearn, coef0 for r, and degree for d. However, degree is usually the more important hyperparameter to tune. Below is an illustration of the boundaries formed by 16 combinations of degree and coef0. You can see that the complexity of the boundaries increases with either coef0 or degree. The boundaries can become overcomplicated at very high degree and coef0. In this case, our polynomial SVM performs the best at degree=2 and any values of coef0.

Of course, besides degree and coef0, we also need to tune C which is actually universal for SVC models in SKLearn. At degree=2 and coef0=0, different values of C yield the following boundaries. It is actually similar to the other hyperparameters in that higher C raises the boundaries’ complexity.

Radial basis function kernel

Radial basis function (RBF), or Gaussian kernel, has the following formula

K(x_1, x_2) = \exp(\gamma\left\Vert x_1 - x_2\right\Vert^2)

With only γ as the hyperparameter. In theory, the RBF kernel can map data to an infinite-dimensional space and can handle highly nonlinear patterns. The transition of SVM boundaries with gamma and C is as follows. We can see that increasing gamma yields more complex decision boundaries. In fact, at very high gamma like 10 or 100, the boundaries wrap around every single instances, and the model performances become very poor. However, please note that this is the case only in this experiment. If you have data that is more difficult, a high gamma may be better, which again, begs the needs of model tuning.

Sigmoid kernel

SKLearn provides another kernel function which is sigmoid. In my opinion, this one is quite uncommon. Personally, I have never used it before. Regardless, let see how it looks like for completeness. The sigmoid kernel is as follows.

K(x_1, x_2) = tanh(\gamma (x_1\cdot x_2) + r)

With γ and r being hyperparameters represented by gamma and coef0 in SKLearn. Below is some examples of SVM boundaries from different values of gamma and coef0 with sigmoid kernel. The boundaries get odd very fast as gamma or coef0 increases. In fact, there are only two combinations of gamma and coef0 that yield good accuracy. So, I keep my opinion that this one is not a common kernel, and will stay not using them in the future.

Tuning support vector machine

Finally, let see how to tune a SVM with SKLearn. This turns out to be very easy because the process is exactly the same as any models that we have learned. We create a parameter grid, then a grid search, and finally fit it on the data. The only difference is the grid which now include the kernel functions and their hyperparameters. One thing with SVM is that this is a complex model which takes quite long to train, so I would try to keep the number of hyperparameter values down. In fact, most of the times, I will just straight up use the RBF kernel and tune C and gamma. Regardless, if you want to tune kernel functions as well, the code may look like below.

In the previous tests, we have seen all kernel can reach a maximum of 94% CV accuracy. The same thing happened here. The selected model is linear SVM simply because it is the first one on the list.

Conclusion

In this post, we discussed tuning support vector machine in more details. We went through hyperparameters including the soft margin C and different kernel functions. In practice though, you can probably just use the RBF kernel and tune gamma and C for most data sets. One final note is that, due to its complexity, if your data has above about 10,000 instances, it is not a good idea to use SVM. You may have to sample a smaller data set or utilize a different training strategy so that your SVMs tune in reasonable times. With that note, I will conclude my post here. See you next time!