an example of a support vector machine pipeline

With the recent posts about support vector machine, hopefully you have had a good understanding on this elegant model. One thing is that we have not actually perform modeling with SVM on actual data yet. So, in this post, I will go through two examples of building a complete support vector machine pipeline, one for classification, and one for regression. So, let us start.

Support vector machine pipeline for classification

First, we will go through the example with classification. We will reuse the heart_disease data from Kaggle, We discussed the explanatory analysis of this data before, so I am not including it here. The target in this data is HeartDisease which indicates if the patient had a heart issue or not. The complete notebook is available in my GitHub. First, we load the data, check a few rows, and perform a train-test split.

heart_disease Download

In [1]:

import pandas as pd
from matplotlib import pyplot as plt
import numpy as np

In [2]:

data = pd.read_csv('heart_disease.csv')
data.head(n=2)

Out[2]:

	Age	Sex	ChestPainType	RestingBP	Cholesterol	FastingBS	RestingECG	MaxHR	ExerciseAngina	Oldpeak	ST_Slope	HeartDisease
0	40	M	ATA	140	289	0	Normal	172	N	0.0	Up	0
1	49	F	NAP	160	180	0	Normal	156	N	1.0	Flat	1

In [3]:

from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.25)

Processing pipeline

The processing pipeline for this data is pretty typical. All numeric columns undergo imputation and standardization. Additionally, Cholesterol and RestingBP have 0 values which do not make sense medically, so I change them to missing first to be imputed later. Categorical columns only go through one hot encoding.

In [4]:

num_cols = ['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak']
cat_cols = ['Sex', 'ChestPainType','RestingECG', 'ExerciseAngina', 'ST_Slope']
target = 'HeartDisease'

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

def remove_0(X):
    X.loc[X['Cholesterol']==0, 'Cholesterol'] = np.nan
    X.loc[X['RestingBP']==0, 'RestingBP'] = np.nan
    return X

num_pipeline = Pipeline([
    ('remove 0', FunctionTransformer(remove_0, validate=False)),
    ('impute', SimpleImputer(strategy='median')),
    ('standardize', StandardScaler())
])

cat_pipeline = Pipeline([
    ('encode', OneHotEncoder())
])

process_pipeline = ColumnTransformer([
    ('numeric', num_pipeline, num_cols),
    ('class', cat_pipeline, cat_cols)
])

Modeling pipeline

After this point, we can apply the code in tuning SVC. A small difference is that I will combine the SVC and the processing step built earlier in a complete modeling pipeline so everything can be trained and applied at once. Next, we create a parameter grid for a grid search. The hyperparameters below are fairly typical for SVC, however, you can add more if you like. Finally, we build the grid search object and finish this step.

In [5]:

from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

svc = Pipeline([
    ('processing', process_pipeline), 
    ('svc', SVC())
])

param_grid = [
    {'svc__kernel':['linear'], 
     'svc__C' : [0.001, 0.1, 1, 10, 100]},
    {'svc__kernel':['poly'], 
     'svc__degree' : [2, 3, 4], 
     'svc__coef0' : [0, 1, 10], 
     'svc__C' : [0.001, 0.1, 1, 10, 100]},
    {'svc__kernel':['rbf'], 
     'svc__gamma' : [0.001, 0.01, 0.1, 1, 10, 100, 1000], 
     'svc__C' : [0.001, 0.01, 0.1, 1, 10, 100]}
]

grid_search = GridSearchCV(svc, param_grid, cv=5, scoring='accuracy', return_train_score=True)

Training and testing

To train perform the grid search, simply call fit(). Depending on the number of hyperparameter values you include, this step may take a bit of times. After that though, we have a fully trained SVC pipeline to apply on any data (similar to the one we have, of course). If curious, we can check the best hyperparameter values with best_params_ from the grid search object. Similarly, the best_score_ property gives us the CV score (accuracy in this case) of the best model, and best_estimator_ gives us the best pipeline object. Just for scoring though, we can directly use the grid search to score() or predict(). Overall, we end up with 87.35% CV training accuracy, and 87.8% testing accuracy.

In [7]:

print(grid_search.best_params_)
print(grid_search.best_score_)

{'svc__C': 1, 'svc__gamma': 0.1, 'svc__kernel': 'rbf'}
0.8735216333439121

In [8]:

grid_search.score(test,test[target])

Out[8]:

0.8782608695652174

Support vector machine pipeline for regression

For this example, we will use the auto-mpg data (UCI machine learning repository). Like the heart_disease data, we discussed an exploratory analysis of this data already, so I am skipping it here. The complete notebook is available in my GitHub. The target in this data is mpg (miles-per-gallon) of the cars. First, we load the data and split it into training and testing sets.

In [2]:

data = pd.read_csv('auto-mpg.csv')
data.head(2)

Out[2]:

	mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin
0	18.0	8	307.0	130.0	3504	12.0	70	1
1	15.0	8	350.0	165.0	3693	11.5	70	1

In [3]:

from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.25)

Processing pipeline

Unlike the previous example, processing this data set is very typical. We perform imputation and standardization on the numeric columns, and one hot encoding on the only categorical column. There is nothing else notable here.

In [4]:

num_cols = ['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year']
cat_cols = ['origin']
target = 'mpg'

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

#pipeline for numeric features
#we need to impute horsepower
num_pipeline = Pipeline([
    ('impute', SimpleImputer(strategy='median')),
    ('standardize', StandardScaler())
])

#pipeline for class features
cat_pipeline = Pipeline([
    ('encoder', OneHotEncoder())
])

#full pipeline - combine numeric and class pipelines
process_pipeline = ColumnTransformer([
    ('numeric', num_pipeline, num_cols),
    ('class', cat_pipeline, cat_cols)
])

Modeling pipeline

If you have not realized, the modeling pipeline of this example is 99% identical to the previous one. We only replace the SVC model with SVR since we are working with a regression problem. Also, I removed a few hyperparameters values since the SVR model seems to be slower. In practice though, if you are working on a real project, do not skimp on the hyperparameters.

In [5]:

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR

svr = Pipeline([
    ('processing', process_pipeline),
    ('svr',SVR())
])

param_grid = [
    {'svr__kernel':['linear'], 
     'svr__C' : [0.1, 1, 10]},
    {'svr__kernel':['poly'], 
     'svr__degree' : [2, 3, 4], 
     'svr__coef0' : [0, 1, 10], 
     'svr__C' : [0.1, 1, 10]},
    {'svr__kernel':['rbf'], 
     'svr__gamma' : [0.001, 0.01, 0.1, 1, 10, 100], 
     'svr__C' : [0.1, 1, 10]}
]

grid_search = GridSearchCV(svr, param_grid, cv=5, scoring='r2', return_train_score=True)

Training and testing

Finally, we train the grid search and investigate the best model. It gets a CV training R2 of 0.89 and testing R2 of about 0.86. Overall, we have a fairly good model. In the future, we will go through the process of comparing different models as well.

In [6]:

grid_search.fit(train, train[target])

Out[6]:

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('processing',
                                        ColumnTransformer(transformers=[('numeric',
                                                                         Pipeline(steps=[('impute',
                                                                                          SimpleImputer(strategy='median')),
                                                                                         ('standardize',
                                                                                          StandardScaler())]),
                                                                         ['cylinders',
                                                                          'displacement',
                                                                          'horsepower',
                                                                          'weight',
                                                                          'acceleration',
                                                                          'year']),
                                                                        ('class',
                                                                         Pipeline(steps=[('encoder',
                                                                                          OneHotEncoder())]),
                                                                         ['origin'])])),
                                       ('svr', SVR())]),
             param_grid=[{'svr__C': [0.1, 1, 10], 'svr__kernel': ['linear']},
                         {'svr__C': [0.1, 1, 10], 'svr__coef0': [0, 1, 10],
                          'svr__degree': [2, 3, 4], 'svr__kernel': ['poly']},
                         {'svr__C': [0.1, 1, 10],
                          'svr__gamma': [0.001, 0.01, 0.1, 1, 10, 100],
                          'svr__kernel': ['rbf']}],
             return_train_score=True, scoring='r2')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

GridSearchCV

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('processing',
                                        ColumnTransformer(transformers=[('numeric',
                                                                         Pipeline(steps=[('impute',
                                                                                          SimpleImputer(strategy='median')),
                                                                                         ('standardize',
                                                                                          StandardScaler())]),
                                                                         ['cylinders',
                                                                          'displacement',
                                                                          'horsepower',
                                                                          'weight',
                                                                          'acceleration',
                                                                          'year']),
                                                                        ('class',
                                                                         Pipeline(steps=[('encoder',
                                                                                          OneHotEncoder())]),
                                                                         ['origin'])])),
                                       ('svr', SVR())]),
             param_grid=[{'svr__C': [0.1, 1, 10], 'svr__kernel': ['linear']},
                         {'svr__C': [0.1, 1, 10], 'svr__coef0': [0, 1, 10],
                          'svr__degree': [2, 3, 4], 'svr__kernel': ['poly']},
                         {'svr__C': [0.1, 1, 10],
                          'svr__gamma': [0.001, 0.01, 0.1, 1, 10, 100],
                          'svr__kernel': ['rbf']}],
             return_train_score=True, scoring='r2')

estimator: Pipeline

Pipeline(steps=[('processing',
                 ColumnTransformer(transformers=[('numeric',
                                                  Pipeline(steps=[('impute',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('standardize',
                                                                   StandardScaler())]),
                                                  ['cylinders', 'displacement',
                                                   'horsepower', 'weight',
                                                   'acceleration', 'year']),
                                                 ('class',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder())]),
                                                  ['origin'])])),
                ('svr', SVR())])

processing: ColumnTransformer

ColumnTransformer(transformers=[('numeric',
                                 Pipeline(steps=[('impute',
                                                  SimpleImputer(strategy='median')),
                                                 ('standardize',
                                                  StandardScaler())]),
                                 ['cylinders', 'displacement', 'horsepower',
                                  'weight', 'acceleration', 'year']),
                                ('class',
                                 Pipeline(steps=[('encoder', OneHotEncoder())]),
                                 ['origin'])])

numeric

['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year']

SimpleImputer

SimpleImputer(strategy='median')

StandardScaler

StandardScaler()

class

['origin']

OneHotEncoder

OneHotEncoder()

SVR

SVR()

In [7]:

print(grid_search.best_params_)
print(grid_search.best_score_)

{'svr__C': 1, 'svr__coef0': 1, 'svr__degree': 3, 'svr__kernel': 'poly'}
0.8895757539078831

In [8]:

grid_search.score(test, test[target])

Out[8]:

0.857216775181356

Conclusion

In this post, we have gone through two fairly simple examples for building a support vector machine pipeline. I hope you would find these useful because I tried to write them fairly general for reuses in different data sets with minimal modification. With this, we are pretty much done with SVM for the time being. So, I will end this post here. See you again!