Support Vector Machine Pipeline

an example of a support vector machine pipeline

With the recent posts about support vector machine, hopefully you have had a good understanding on this elegant model. One thing is that we have not actually perform modeling with SVM on actual data yet. So, in this post, I will go through two examples of building a complete support vector machine pipeline, one for classification, and one for regression. So, let us start.

Support vector machine pipeline for classification

First, we will go through the example with classification. We will reuse the heart_disease data from Kaggle, We discussed the explanatory analysis of this data before, so I am not including it here. The target in this data is HeartDisease which indicates if the patient had a heart issue or not. The complete notebook is available in my GitHub. First, we load the data, check a few rows, and perform a train-test split.

Processing pipeline

The processing pipeline for this data is pretty typical. All numeric columns undergo imputation and standardization. Additionally, Cholesterol and RestingBP have 0 values which do not make sense medically, so I change them to missing first to be imputed later. Categorical columns only go through one hot encoding.

Modeling pipeline

After this point, we can apply the code in tuning SVC. A small difference is that I will combine the SVC and the processing step built earlier in a complete modeling pipeline so everything can be trained and applied at once. Next, we create a parameter grid for a grid search. The hyperparameters below are fairly typical for SVC, however, you can add more if you like. Finally, we build the grid search object and finish this step.

Training and testing

To train perform the grid search, simply call fit(). Depending on the number of hyperparameter values you include, this step may take a bit of times. After that though, we have a fully trained SVC pipeline to apply on any data (similar to the one we have, of course). If curious, we can check the best hyperparameter values with best_params_ from the grid search object. Similarly, the best_score_ property gives us the CV score (accuracy in this case) of the best model, and best_estimator_ gives us the best pipeline object. Just for scoring though, we can directly use the grid search to score() or predict(). Overall, we end up with 87.35% CV training accuracy, and 87.8% testing accuracy.

Support vector machine pipeline for regression

For this example, we will use the auto-mpg data (UCI machine learning repository). Like the heart_disease data, we discussed an exploratory analysis of this data already, so I am skipping it here. The complete notebook is available in my GitHub. The target in this data is mpg (miles-per-gallon) of the cars. First, we load the data and split it into training and testing sets.

Processing pipeline

Unlike the previous example, processing this data set is very typical. We perform imputation and standardization on the numeric columns, and one hot encoding on the only categorical column. There is nothing else notable here.

Modeling pipeline

If you have not realized, the modeling pipeline of this example is 99% identical to the previous one. We only replace the SVC model with SVR since we are working with a regression problem. Also, I removed a few hyperparameters values since the SVR model seems to be slower. In practice though, if you are working on a real project, do not skimp on the hyperparameters.

Training and testing

Finally, we train the grid search and investigate the best model. It gets a CV training R2 of 0.89 and testing R2 of about 0.86. Overall, we have a fairly good model. In the future, we will go through the process of comparing different models as well.

Conclusion

In this post, we have gone through two fairly simple examples for building a support vector machine pipeline. I hope you would find these useful because I tried to write them fairly general for reuses in different data sets with minimal modification. With this, we are pretty much done with SVM for the time being. So, I will end this post here. See you again!