The pipeline that we have learned previously is very useful but only performs a fixed sequence of transformation on the input. However, more often than not, we want to apply a different process on different parts of data. For examples, categorical data should undergo encoding, some numerical data just need standardization, while other needs log transformation too, etc. So, in this post, we will discuss how to construct a complete data pipeline that performs everything simultaneously. So, let us dig in right away!
Data to demonstrate
In this post, we get back to the students1000.csv
data which needs pretty much all the processing we have learned. You can download the data below, and get the complete Jupyter notebook here. Like usual, we start with importing libraries, loading data, and splitting train-test. Next, we check the columns’ info()
, create lists for numeric and categorical columns, and check their distributions. Based on the result, I decide to have three pipelines:
1. HighSchoolGPA
, AvgDailyStudyTime
, TotalAbsence
, and FirstYearGPA
will undergo standardization and imputation
2. FamilyIncome
will be log transformed, standardized, then imputed
3. Major
and State
will go through one hot encoder.
Note that while we do have some missing values in State
, SKLearn OneHotEncoder
can just group them together with infrequent classes, so we do not have to impute them.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
data = pd.read_csv('students1000.csv')
train, test = train_test_split(data, test_size=0.2)
train[num_cols].hist(figsize=(6,9))
plt.show()
for col in cat_cols:
train[col].value_counts().plot.bar(rot=20, figsize=(4,4))
plt.show()
Data pipeline without log transformation
HighSchoolGPA
, AvgDailyStudyTime
, TotalAbsence
, and FirstYearGPA
, have already had well symmetric distributions, therefore, I will just standardize and impute them. Unlike before, we only create their pipeline at this point and will not fit or transform them yet. Below, you can see we import the needed transformers and create a pipeline with the two mentioned steps.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
num_pipeline = Pipeline([
('standardize', StandardScaler()),
('impute', SimpleImputer(strategy='median'))
])
Data pipeline with log transformation
FamilyIncome
has a very skewed distribution, so we will perform a log transformation in addition to standardizing and imputing. Furthermore, a log transformation cannot work on negative numbers, so this step must come before standardization. Now, let us talk a bit about the log transformer. This is a static function (logarithm) without the needs of any estimations, so SKLearn does not build a separate model for it. Instead, we use FunctionTransformer
which allows wrapping a static function in a transformer usable in SKLearn pipelines. We define the function log_transform()
to just returns the log version of the input. Then, log_transform()
is used as input for the FunctionTransformer()
in our pipeline. The other two steps are like before.
from sklearn.pipeline import FunctionTransformer
def log_transform(data):
return np.log(data)
log_pipeline = Pipeline([
('log transform', FunctionTransformer(log_transform)),
('standardize', StandardScaler()),
('impute', SimpleImputer(strategy='median'))
])
Data pipeline for categories
This pipeline is very easy with only an one hot encoder. In fact, we do not even need to use Pipeline
. However, I will keep it general as we may want to add other steps in this pipeline too. Now, take a closer look at OneHotEncoder()
. We use the handle_unknown
option like before, but the max_categories
is new. It sets the maximum number of one hot codes for each and all undergoing columns, which is 5
in this code. This means that, only the five classes with the highest frequencies in each column are transformed into one hot codes. The rest, including missing values, belong to one single infrequent class.
from sklearn.preprocessing import OneHotEncoder
cat_pipeline = Pipeline([
('encode', OneHotEncoder(max_categories=5, handle_unknown='infrequent_if_exist'))
])
Putting all together
With the pipelines ready, we can now combine everything into one single complete data pipeline using the composer ColumnTransformer()
in SKLearn. The syntax to create a ColumnTransformer() is fairly similar to a Pipeline in that you need to provide a list of transformers. However, now, you also need to specify the list of columns to apply each transformer too. Overall, the syntax is as below
full_pipeline = ColumnTransformer([
('<pipeline name 1>', <pipeline 1>, <list 1>),
('<pipeline name 2>', <pipeline 2>, <list 2>),
...
])
Back to our students data, first, we create three lists, log_cols
with only FamilyIncome
, num_cols
for the remaining numeric columns, and cat_cols
for categorical ones. Then, we can now have our ColumnTransformer
with the three previously created pipelines and their targets. After this point, fitting and applying are just like before with fit_transform()
and transform()
. Once more time, do not use fit_transform()
on testing data!
from sklearn.compose import ColumnTransformer
log_cols = ['FamilyIncome']
num_cols = ['HighSchoolGPA', 'AvgDailyStudyTime', 'TotalAbsence', 'FirstYearGPA']
cat_cols = ['Major','State']
full_pipeline = ColumnTransformer([
('log trans', log_pipeline, log_cols),
('numeric', num_pipeline, num_cols),
('class', cat_pipeline, cat_cols)
])
train_processed = full_pipeline.fit_transform(train)
train_processed.shape
(800, 14)
test_processed = full_pipeline.transform(test)
test_processed.shape
(200, 14)
After this, we have a training data set and a testing data set error free (supposedly) and ready for any further analysis. Right now though, I am just gonna show you their histograms. On the first look, the skewness in FamilyIncome
(column 4
) has been fixed. There are some differences between training and testing here and there, but nothing serious. This is just mostly because of the small data size. Also, you can see several new binary columns which are created from one hot encoders.
pd.DataFrame(train_processed).hist(figsize=(8,10))
plt.show()
pd.DataFrame(test_processed).hist(figsize=(8,10))
plt.show()
Overall, you may think that defining pipeline this way is more complicated since it involves more concepts. However, as you get used to SKLearn, this is among the best way. Pipeline code is very nicely organized and easy to maintain. And, fitting or transforming with pipeline, is just as easy. So, I highly recommend a deep understanding about this topic.
Wrapping up
With this post, we should now have a concrete understanding on preprocessing (tabular) data. The several methods that we learned, while definitely not enough, cover most of the common issues that we expect in data. So, be confident, as you can process data fairly well now! After this post, we will get into modeling data with more exciting stuffs, so see you again.
Pingback: Regression Analysis - Data Science from a Practical Perspective
Pingback: Linear Regression Model - Data Science from a Practical Perspective