With the recent posts about support vector machine, hopefully you have had a good understanding on this elegant model. One thing is that we have not actually perform modeling with SVM on actual data yet. So, in this post, I will go through two examples of building a complete support vector machine pipeline, one for classification, and one for regression. So, let us start.
Support vector machine pipeline for classification
First, we will go through the example with classification. We will reuse the heart_disease
data from Kaggle, We discussed the explanatory analysis of this data before, so I am not including it here. The target in this data is HeartDisease
which indicates if the patient had a heart issue or not. The complete notebook is available in my GitHub. First, we load the data, check a few rows, and perform a train-test split.
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
data = pd.read_csv('heart_disease.csv')
data.head(n=2)
Age | Sex | ChestPainType | RestingBP | Cholesterol | FastingBS | RestingECG | MaxHR | ExerciseAngina | Oldpeak | ST_Slope | HeartDisease | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 40 | M | ATA | 140 | 289 | 0 | Normal | 172 | N | 0.0 | Up | 0 |
1 | 49 | F | NAP | 160 | 180 | 0 | Normal | 156 | N | 1.0 | Flat | 1 |
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size=0.25)
Processing pipeline
The processing pipeline for this data is pretty typical. All numeric columns undergo imputation and standardization. Additionally, Cholesterol
and RestingBP
have 0
values which do not make sense medically, so I change them to missing first to be imputed later. Categorical columns only go through one hot encoding.
num_cols = ['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak']
cat_cols = ['Sex', 'ChestPainType','RestingECG', 'ExerciseAngina', 'ST_Slope']
target = 'HeartDisease'
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
def remove_0(X):
X.loc[X['Cholesterol']==0, 'Cholesterol'] = np.nan
X.loc[X['RestingBP']==0, 'RestingBP'] = np.nan
return X
num_pipeline = Pipeline([
('remove 0', FunctionTransformer(remove_0, validate=False)),
('impute', SimpleImputer(strategy='median')),
('standardize', StandardScaler())
])
cat_pipeline = Pipeline([
('encode', OneHotEncoder())
])
process_pipeline = ColumnTransformer([
('numeric', num_pipeline, num_cols),
('class', cat_pipeline, cat_cols)
])
Modeling pipeline
After this point, we can apply the code in tuning SVC. A small difference is that I will combine the SVC and the processing step built earlier in a complete modeling pipeline so everything can be trained and applied at once. Next, we create a parameter grid for a grid search. The hyperparameters below are fairly typical for SVC, however, you can add more if you like. Finally, we build the grid search object and finish this step.
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
svc = Pipeline([
('processing', process_pipeline),
('svc', SVC())
])
param_grid = [
{'svc__kernel':['linear'],
'svc__C' : [0.001, 0.1, 1, 10, 100]},
{'svc__kernel':['poly'],
'svc__degree' : [2, 3, 4],
'svc__coef0' : [0, 1, 10],
'svc__C' : [0.001, 0.1, 1, 10, 100]},
{'svc__kernel':['rbf'],
'svc__gamma' : [0.001, 0.01, 0.1, 1, 10, 100, 1000],
'svc__C' : [0.001, 0.01, 0.1, 1, 10, 100]}
]
grid_search = GridSearchCV(svc, param_grid, cv=5, scoring='accuracy', return_train_score=True)
Training and testing
To train perform the grid search, simply call fit()
. Depending on the number of hyperparameter values you include, this step may take a bit of times. After that though, we have a fully trained SVC pipeline to apply on any data (similar to the one we have, of course). If curious, we can check the best hyperparameter values with best_params_
from the grid search object. Similarly, the best_score_
property gives us the CV score (accuracy in this case) of the best model, and best_estimator_
gives us the best pipeline object. Just for scoring though, we can directly use the grid search to score() or predict(). Overall, we end up with 87.35%
CV training accuracy, and 87.8%
testing accuracy.
grid_search.fit(train,train[target])
GridSearchCV(cv=5, estimator=Pipeline(steps=[('processing', ColumnTransformer(transformers=[('numeric', Pipeline(steps=[('remove ' '0', FunctionTransformer(func=<function remove_0 at 0x000001D43841CB80>)), ('impute', SimpleImputer(strategy='median')), ('standardize', StandardScaler())]), ['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak']), ('class', P... 'ExerciseAngina', 'ST_Slope'])])), ('svc', SVC())]), param_grid=[{'svc__C': [0.001, 0.1, 1, 10, 100], 'svc__kernel': ['linear']}, {'svc__C': [0.001, 0.1, 1, 10, 100], 'svc__coef0': [0, 1, 10], 'svc__degree': [2, 3, 4], 'svc__kernel': ['poly']}, {'svc__C': [0.001, 0.01, 0.1, 1, 10, 100], 'svc__gamma': [0.001, 0.01, 0.1, 1, 10, 100, 1000], 'svc__kernel': ['rbf']}], return_train_score=True, scoring='accuracy')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=Pipeline(steps=[('processing', ColumnTransformer(transformers=[('numeric', Pipeline(steps=[('remove ' '0', FunctionTransformer(func=<function remove_0 at 0x000001D43841CB80>)), ('impute', SimpleImputer(strategy='median')), ('standardize', StandardScaler())]), ['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak']), ('class', P... 'ExerciseAngina', 'ST_Slope'])])), ('svc', SVC())]), param_grid=[{'svc__C': [0.001, 0.1, 1, 10, 100], 'svc__kernel': ['linear']}, {'svc__C': [0.001, 0.1, 1, 10, 100], 'svc__coef0': [0, 1, 10], 'svc__degree': [2, 3, 4], 'svc__kernel': ['poly']}, {'svc__C': [0.001, 0.01, 0.1, 1, 10, 100], 'svc__gamma': [0.001, 0.01, 0.1, 1, 10, 100, 1000], 'svc__kernel': ['rbf']}], return_train_score=True, scoring='accuracy')
Pipeline(steps=[('processing', ColumnTransformer(transformers=[('numeric', Pipeline(steps=[('remove 0', FunctionTransformer(func=<function remove_0 at 0x000001D43841CB80>)), ('impute', SimpleImputer(strategy='median')), ('standardize', StandardScaler())]), ['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak']), ('class', Pipeline(steps=[('encode', OneHotEncoder())]), ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope'])])), ('svc', SVC())])
ColumnTransformer(transformers=[('numeric', Pipeline(steps=[('remove 0', FunctionTransformer(func=<function remove_0 at 0x000001D43841CB80>)), ('impute', SimpleImputer(strategy='median')), ('standardize', StandardScaler())]), ['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak']), ('class', Pipeline(steps=[('encode', OneHotEncoder())]), ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope'])])
['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak']
FunctionTransformer(func=<function remove_0 at 0x000001D43841CB80>)
SimpleImputer(strategy='median')
StandardScaler()
['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']
OneHotEncoder()
SVC()
print(grid_search.best_params_)
print(grid_search.best_score_)
{'svc__C': 1, 'svc__gamma': 0.1, 'svc__kernel': 'rbf'} 0.8735216333439121
grid_search.score(test,test[target])
0.8782608695652174
Support vector machine pipeline for regression
For this example, we will use the auto-mpg
data (UCI machine learning repository). Like the heart_disease
data, we discussed an exploratory analysis of this data already, so I am skipping it here. The complete notebook is available in my GitHub. The target in this data is mpg
(miles-per-gallon) of the cars. First, we load the data and split it into training and testing sets.
data = pd.read_csv('auto-mpg.csv')
data.head(2)
mpg | cylinders | displacement | horsepower | weight | acceleration | year | origin | |
---|---|---|---|---|---|---|---|---|
0 | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 70 | 1 |
1 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 | 1 |
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size=0.25)
Processing pipeline
Unlike the previous example, processing this data set is very typical. We perform imputation and standardization on the numeric columns, and one hot encoding on the only categorical column. There is nothing else notable here.
num_cols = ['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year']
cat_cols = ['origin']
target = 'mpg'
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
#pipeline for numeric features
#we need to impute horsepower
num_pipeline = Pipeline([
('impute', SimpleImputer(strategy='median')),
('standardize', StandardScaler())
])
#pipeline for class features
cat_pipeline = Pipeline([
('encoder', OneHotEncoder())
])
#full pipeline - combine numeric and class pipelines
process_pipeline = ColumnTransformer([
('numeric', num_pipeline, num_cols),
('class', cat_pipeline, cat_cols)
])
Modeling pipeline
If you have not realized, the modeling pipeline of this example is 99% identical to the previous one. We only replace the SVC model with SVR since we are working with a regression problem. Also, I removed a few hyperparameters values since the SVR model seems to be slower. In practice though, if you are working on a real project, do not skimp on the hyperparameters.
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVR
svr = Pipeline([
('processing', process_pipeline),
('svr',SVR())
])
param_grid = [
{'svr__kernel':['linear'],
'svr__C' : [0.1, 1, 10]},
{'svr__kernel':['poly'],
'svr__degree' : [2, 3, 4],
'svr__coef0' : [0, 1, 10],
'svr__C' : [0.1, 1, 10]},
{'svr__kernel':['rbf'],
'svr__gamma' : [0.001, 0.01, 0.1, 1, 10, 100],
'svr__C' : [0.1, 1, 10]}
]
grid_search = GridSearchCV(svr, param_grid, cv=5, scoring='r2', return_train_score=True)
Training and testing
Finally, we train the grid search and investigate the best model. It gets a CV training R2 of 0.89 and testing R2 of about 0.86. Overall, we have a fairly good model. In the future, we will go through the process of comparing different models as well.
grid_search.fit(train, train[target])
GridSearchCV(cv=5, estimator=Pipeline(steps=[('processing', ColumnTransformer(transformers=[('numeric', Pipeline(steps=[('impute', SimpleImputer(strategy='median')), ('standardize', StandardScaler())]), ['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year']), ('class', Pipeline(steps=[('encoder', OneHotEncoder())]), ['origin'])])), ('svr', SVR())]), param_grid=[{'svr__C': [0.1, 1, 10], 'svr__kernel': ['linear']}, {'svr__C': [0.1, 1, 10], 'svr__coef0': [0, 1, 10], 'svr__degree': [2, 3, 4], 'svr__kernel': ['poly']}, {'svr__C': [0.1, 1, 10], 'svr__gamma': [0.001, 0.01, 0.1, 1, 10, 100], 'svr__kernel': ['rbf']}], return_train_score=True, scoring='r2')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=Pipeline(steps=[('processing', ColumnTransformer(transformers=[('numeric', Pipeline(steps=[('impute', SimpleImputer(strategy='median')), ('standardize', StandardScaler())]), ['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year']), ('class', Pipeline(steps=[('encoder', OneHotEncoder())]), ['origin'])])), ('svr', SVR())]), param_grid=[{'svr__C': [0.1, 1, 10], 'svr__kernel': ['linear']}, {'svr__C': [0.1, 1, 10], 'svr__coef0': [0, 1, 10], 'svr__degree': [2, 3, 4], 'svr__kernel': ['poly']}, {'svr__C': [0.1, 1, 10], 'svr__gamma': [0.001, 0.01, 0.1, 1, 10, 100], 'svr__kernel': ['rbf']}], return_train_score=True, scoring='r2')
Pipeline(steps=[('processing', ColumnTransformer(transformers=[('numeric', Pipeline(steps=[('impute', SimpleImputer(strategy='median')), ('standardize', StandardScaler())]), ['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year']), ('class', Pipeline(steps=[('encoder', OneHotEncoder())]), ['origin'])])), ('svr', SVR())])
ColumnTransformer(transformers=[('numeric', Pipeline(steps=[('impute', SimpleImputer(strategy='median')), ('standardize', StandardScaler())]), ['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year']), ('class', Pipeline(steps=[('encoder', OneHotEncoder())]), ['origin'])])
['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year']
SimpleImputer(strategy='median')
StandardScaler()
['origin']
OneHotEncoder()
SVR()
print(grid_search.best_params_)
print(grid_search.best_score_)
{'svr__C': 1, 'svr__coef0': 1, 'svr__degree': 3, 'svr__kernel': 'poly'} 0.8895757539078831
grid_search.score(test, test[target])
0.857216775181356
Conclusion
In this post, we have gone through two fairly simple examples for building a support vector machine pipeline. I hope you would find these useful because I tried to write them fairly general for reuses in different data sets with minimal modification. With this, we are pretty much done with SVM for the time being. So, I will end this post here. See you again!