Previously, we discussed support vector machine and related concepts in a fair amount of details. Now, we will dig a bit deeper into this model. Of course, I will not ramble about all the mathematical details. However, any models must be tuned to achieve the optimal performances. So, in this post, we will discuss different aspects in tuning support vector machine models.
Data in demonstration
For demonstration, I will use the test_exam.csv
data similar to the previous one which includes students’ reading and practicing time for an exam, and whether they passed or failed. However, the data size is bigger and not as separable between the two classes. I am still focusing more on explanation in this post and will not include too many codes. However, you can get the complete notebook here if interested. Also, the data is downloadable below.
We first load the important libraries and check a scatter plot of the three features. You can see that instances in the two groups are pretty well defined and separated. However, they overlap each other in a fairly wide region. Lastly, we extract the features as X
and label as y
for convenience later on.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
data = pd.read_csv('test_exam.csv')
data.head(2)
Reading | Practicing | Result | |
---|---|---|---|
0 | 5.773836 | 8.061479 | 1 |
1 | 8.658042 | 7.952046 | 1 |
plt.figure(figsize=(5,5))
plt.scatter(data['Reading'],data['Practicing'],c=data['Result'],s=50)
plt.show()
X = data[['Reading','Practicing']].values
y = data['Result'].values
Tuning soft margin hyperparameter
Recall, soft margins allow the boundary area of a SVM can include some instances instead of being completely empty. In SKLearn, we fits a SVM for classification with the SVC
model. The degree of soft margin can be controlled with the hyperparameter C
. By default, C=1
, and yield the following boundaries in our data with a linear SVM model.
Lower values of C
will make the margin softer, i.e., more instances can be here, and higher C
means harder margins, less instances are allowed in the boundary area. Below are the illustration of boundaries from the same linear SVM model with different C
. As you can see, varying C
changes the margin which in turn shifts the boundaries and the model performance as well. In this case, the linear SVM get the highest CV accuracy, 94%
, at C=0.001
.
Tuning kernel function
A linear SVM only need its C
tuned. However, we have established that, in multiple occasions, a linear SVM is not enough and we need a kernel SVM. The core of this model is the kernel trick that uses a kernel function to form a nonlinear decision boundary. These functions that take two data instances as inputs then outputs one single number representing their similarity (or dot-product). We usually denote kernel functions as K(x1,x2)
with x1
and x2
being the two instances. There are different kernel functions, each has its own set of hyperparameters. I will introduce the mathematical formulas of each kernel so that you understand why they have such hyperparameters.
Polynomial kernel
Mathematically, the polynomial kernel has the following form
With r being a constant, and d being the polynomial degree. Both are tunable parameters in SKLearn, coef0
for r
, and degree
for d
. However, degree
is usually the more important hyperparameter to tune. Below is an illustration of the boundaries formed by 16 combinations of degree
and coef0
. You can see that the complexity of the boundaries increases with either coef0
or degree
. The boundaries can become overcomplicated at very high degree
and coef0
. In this case, our polynomial SVM performs the best at degree=2
and any values of coef0
.
Of course, besides degree
and coef0
, we also need to tune C
which is actually universal for SVC
models in SKLearn. At degree=2
and coef0=0
, different values of C
yield the following boundaries. It is actually similar to the other hyperparameters in that higher C
raises the boundaries’ complexity.
Radial basis function kernel
Radial basis function (RBF), or Gaussian kernel, has the following formula
With only γ
as the hyperparameter. In theory, the RBF kernel can map data to an infinite-dimensional space and can handle highly nonlinear patterns. The transition of SVM boundaries with gamma
and C
is as follows. We can see that increasing gamma
yields more complex decision boundaries. In fact, at very high gamma
like 10 or 100, the boundaries wrap around every single instances, and the model performances become very poor. However, please note that this is the case only in this experiment. If you have data that is more difficult, a high gamma
may be better, which again, begs the needs of model tuning.
Sigmoid kernel
SKLearn provides another kernel function which is sigmoid. In my opinion, this one is quite uncommon. Personally, I have never used it before. Regardless, let see how it looks like for completeness. The sigmoid kernel is as follows.
With γ
and r
being hyperparameters represented by gamma
and coef0
in SKLearn. Below is some examples of SVM boundaries from different values of gamma
and coef0
with sigmoid kernel. The boundaries get odd very fast as gamma
or coef0
increases. In fact, there are only two combinations of gamma
and coef0
that yield good accuracy. So, I keep my opinion that this one is not a common kernel, and will stay not using them in the future.
Tuning support vector machine
Finally, let see how to tune a SVM with SKLearn. This turns out to be very easy because the process is exactly the same as any models that we have learned. We create a parameter grid, then a grid search, and finally fit it on the data. The only difference is the grid which now include the kernel functions and their hyperparameters. One thing with SVM is that this is a complex model which takes quite long to train, so I would try to keep the number of hyperparameter values down. In fact, most of the times, I will just straight up use the RBF kernel and tune C
and gamma
. Regardless, if you want to tune kernel functions as well, the code may look like below.
from sklearn.model_selection import GridSearchCV
param_grid = [
{'kernel':['linear'], 'C' : [0.01, 0.1, 1, 10, 100]},
{'kernel':['poly'], 'degree' : [3, 4], 'coef0' : [1, 10], 'C' : [0.01, 0.1, 1, 10, 100]},
{'kernel':['rbf'], 'gamma' : [0.001, 0.01, 0.1, 1, 10, 100], 'C' : [0.001, 0.01, 0.1, 1, 10, 100]}
]
gridsearch = GridSearchCV(SVC(),param_grid,scoring='accuracy')
gridsearch.fit(X,y)
GridSearchCV(estimator=SVC(), param_grid={'C': [0.001, 0.01, 0.1, 1], 'coef0': [0, 1], 'degree': [2, 3], 'gamma': [0.001, 0.01, 0.1, 1], 'kernel': ['linear', 'poly', 'rbf']}, scoring='accuracy')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(estimator=SVC(), param_grid={'C': [0.001, 0.01, 0.1, 1], 'coef0': [0, 1], 'degree': [2, 3], 'gamma': [0.001, 0.01, 0.1, 1], 'kernel': ['linear', 'poly', 'rbf']}, scoring='accuracy')
SVC()
SVC()
In the previous tests, we have seen all kernel can reach a maximum of 94%
CV accuracy. The same thing happened here. The selected model is linear SVM simply because it is the first one on the list.
plt.figure(figsize=(5,5))
draw_svm(X,y,gridsearch.best_estimator_,'linear kernel')
plt.show()
Conclusion
In this post, we discussed tuning support vector machine in more details. We went through hyperparameters including the soft margin C
and different kernel functions. In practice though, you can probably just use the RBF kernel and tune gamma
and C
for most data sets. One final note is that, due to its complexity, if your data has above about 10,000 instances, it is not a good idea to use SVM. You may have to sample a smaller data set or utilize a different training strategy so that your SVMs tune in reasonable times. With that note, I will conclude my post here. See you next time!