an illustration of support vector regression models on different data

We have been discussing support vector machine for a few posts. And, good news! We will keep doing that for a while. This is just an amazing model with so many to talk about. Let us change the air a bit though. In this post, I will introduce a version of SVM for regression, or, support vector regression (SVR) model. SVR is actually quite different from its classification counterpart (SVC) in what it tries to achieve. So, let us dive in.

Support vector regression concept

In SVR models, we still have the concepts of hyperplane, decision margin, and support vectors. However, they are defined very different. Conceptually, the SVR tries to find a decision margin that
1) contains the most training instances while having minimal flatness, and
2) yields minimal prediction errors for the support vectors which are instances lie outside the margin
The below figure roughly illustrate the optimization goal of a SVR model.

an illustration of the optimization goal of support vector regression

Toy example

First, we will have a small example and demonstrate the robustness of SVR. For this, I am using a simple data set that has students’ study time for a test as the feature, and their grade as the label. You can download the data below, and access the complete notebook in my GitHub.

test_grade Download

A scatter plot of the data is as below. Besides the obvious trend of more study times gives better grades, we have some outliers on the top left and bottom right corners.

Next, we will demonstrate the first hyperparameter of SVR – epsilon.

Decision margin and ε

In SVR, the decision margin is no longer the buffer between instances of two classes (we do not have class target in regression!). It is rather a tube that contains the instances. Also different from SVC, the width of the margin in SVR is not the target of optimization. Rather, it is set with a hyperparameter ε (the Greek letter epsilon). ε represents maximum error an instance can have so that it is within the margin. In SVR, only the support vectors, which are instances outside the margin, contribute to the formulation of the prediction.

Below is an illustration of the SVR margins with different values of ε. In the scatter plots, blue markers are regular instances, support vectors get a red border, red lines represent SVR’s predictions, and dashed lines are the margin of SVR. As you can see, very low ε results in a narrow tube that contains very few instances. This in turn makes most instances support vectors. With ε increases, the margin gets wider and consists more data, meaning less support vectors. However, with too few support vectors, the decision hyperplane gets influenced more by outliers and fails to capture the correct pattern.

In general, low ε is better for error, however will increase the complexity of your model due to the large number of support vectors. Regardless, if you only focus on good predictions, ε of 0 or 0.1 is totally good.

Kernel SVR

Just like SVC, we can adapt SVR to handle nonlinear patterns in data with the kernel trick. Here, the kernel trick aims to map the original data to a feature space where the correlation between features and targets is “more linear”.

Data in demonstration

For demonstration of kernel SVR, I will use the weekly version of the daily temperature in Melbourne data (available in Kaggle).

weekly-min-temp Download

The original data include the minimum temperature for every day from 1/1/1981 to 12/31/1990. For simplicity, I aggregate the data to the weekly level. Below is the scatter plot between week number (1 to 53) and temperature. We can see a strong correlation which is obviously not linear.

Kernel SVR and ε

The concept of ε stays the same in support vector regression models, which is the maximum distance (error) of an instance to its prediction so that it is still within the margin. Higher ε means wider tube that contains more instances and less support vectors, and vice versa. Below is an illustration of the default kernel SVR with different ε. For this data, an ε of 2.5 still maintains equal R2 to lower values while having considerably less support vectors. Higher than that, the SVR gradually fails to capture the correct patterns.

Polynomial kernel

Regardless of regression or classification, the polynomial kernel is the same. We still have two hyperparameters, degree and coef0, that need finetuning. Like we discuss previously, increasing either one leads to higher model complexity, which is needed if the data is very nonlinear.

In this temperature data, we have to use a fairly complex model at degree=4 and coef0=10 to get the best possible R2 as illustrated below. Of course, this is just my demonstration for degree and coef0 with fixed epsilon and C (yes, C is still a hyperparameter to tune in SVR).

RBF Kernel

Like the polynomial kernel, there is nothing changed in RBF kernels in SVR compared to SVC. We still have a hyperparameter gamma. Below is an illustration of different gamma and C in our temperature data. In this demonstration, the RBF kernel yields the best performance at gamma=0.01 and C=1 or 10. It also seems that C impact the model more than gamma in this case. All models failed at C=0.01 then become acceptable (albeit some overfitting) at C above 1.

Tuning SVR

Finally, let us try tuning a SVR model. It is very literally identical to SVC, except for the metric to use in grid search, which in this case is r2. I straight up use the epsilon=2.5 found from previously. Tuning with R2 will almost always select the lowest epsilon anyway, since higher epsilon would probably just improve (lower) the number of support vectors.

In [13]:

from sklearn.model_selection import GridSearchCV

param_grid = [
    {'kernel':['linear'], 'C' : [0.01, 0.1, 1, 10]},
    {'kernel':['poly'], 'degree' : [3, 4], 'coef0' : [1, 10], 'C' : [0.01, 0.1, 1, 10]},
    {'kernel':['rbf'], 'gamma' : [0.001, 0.01, 0.1, 1, 10], 'C' : [0.01, 0.1, 1, 10]}
]

gridsearch = GridSearchCV(SVR(epsilon=2.5),param_grid,scoring='r2')
gridsearch.fit(X,y)

Out[13]:

GridSearchCV(estimator=SVR(epsilon=2.5),
             param_grid=[{'C': [0.01, 0.1, 1, 10], 'kernel': ['linear']},
                         {'C': [0.01, 0.1, 1, 10], 'coef0': [1, 10],
                          'degree': [3, 4], 'kernel': ['poly']},
                         {'C': [0.01, 0.1, 1, 10],
                          'gamma': [0.001, 0.01, 0.1, 1, 10],
                          'kernel': ['rbf']}],
             scoring='r2')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [14]:

gridsearch.best_params_

Out[14]:

{'C': 1, 'gamma': 0.01, 'kernel': 'rbf'}

In [15]:

svr = gridsearch.best_estimator_

plt.figure(figsize=(5,5))
cvaccuracy = round(cross_val_score(svr,X,y).mean(),5)
title = 'Kernel SVR $\epsilon$='+str(eps)+', R2: '+str(round(cvaccuracy,2))
draw_svr(X,y,svr,title)
plt.show()

Conclusion

With all the discussion, I hope you have now had a good idea on support vector regression models. As can be seen, they are very powerful model that can capture nonlinearity very well. One final note is that, SVR, at the end of the day, is still a support vector model. This means that it does not scale too well to big data. So, if you have data at sizes around tens of thousands, consider other models or resample your data. And with that, I will stop this post here. See you again!