an illustration of a complete data pipeline where numeric and categorical data undergo different treatments

The pipeline that we have learned previously is very useful but only performs a fixed sequence of transformation on the input. However, more often than not, we want to apply a different process on different parts of data. For examples, categorical data should undergo encoding, some numerical data just need standardization, while other needs log transformation too, etc. So, in this post, we will discuss how to construct a complete data pipeline that performs everything simultaneously. So, let us dig in right away!

Data to demonstrate

In this post, we get back to the students1000.csv data which needs pretty much all the processing we have learned. You can download the data below, and get the complete Jupyter notebook here. Like usual, we start with importing libraries, loading data, and splitting train-test. Next, we check the columns’ info(), create lists for numeric and categorical columns, and check their distributions. Based on the result, I decide to have three pipelines:
1. HighSchoolGPA, AvgDailyStudyTime, TotalAbsence, and FirstYearGPA will undergo standardization and imputation
2. FamilyIncome will be log transformed, standardized, then imputed
3. Major and State will go through one hot encoder.

Note that while we do have some missing values in State, SKLearn OneHotEncoder can just group them together with infrequent classes, so we do not have to impute them.

students1000.csv Download

In [23]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

data = pd.read_csv('students1000.csv')
train, test = train_test_split(data, test_size=0.2)

In [5]:

train[num_cols].hist(figsize=(6,9))
plt.show()

In [6]:

for col in cat_cols:
    train[col].value_counts().plot.bar(rot=20, figsize=(4,4))
    plt.show()

Data pipeline without log transformation

HighSchoolGPA, AvgDailyStudyTime, TotalAbsence, and FirstYearGPA, have already had well symmetric distributions, therefore, I will just standardize and impute them. Unlike before, we only create their pipeline at this point and will not fit or transform them yet. Below, you can see we import the needed transformers and create a pipeline with the two mentioned steps.

In [7]:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

num_pipeline = Pipeline([
    ('standardize', StandardScaler()),
    ('impute', SimpleImputer(strategy='median'))
])

Data pipeline with log transformation

FamilyIncome has a very skewed distribution, so we will perform a log transformation in addition to standardizing and imputing. Furthermore, a log transformation cannot work on negative numbers, so this step must come before standardization. Now, let us talk a bit about the log transformer. This is a static function (logarithm) without the needs of any estimations, so SKLearn does not build a separate model for it. Instead, we use FunctionTransformer which allows wrapping a static function in a transformer usable in SKLearn pipelines. We define the function log_transform() to just returns the log version of the input. Then, log_transform() is used as input for the FunctionTransformer() in our pipeline. The other two steps are like before.

In [8]:

from sklearn.pipeline import FunctionTransformer

def log_transform(data):
    return np.log(data)

log_pipeline = Pipeline([
    ('log transform', FunctionTransformer(log_transform)),
    ('standardize', StandardScaler()),
    ('impute', SimpleImputer(strategy='median'))
])

Data pipeline for categories

This pipeline is very easy with only an one hot encoder. In fact, we do not even need to use Pipeline. However, I will keep it general as we may want to add other steps in this pipeline too. Now, take a closer look at OneHotEncoder(). We use the handle_unknown option like before, but the max_categories is new. It sets the maximum number of one hot codes for each and all undergoing columns, which is 5 in this code. This means that, only the five classes with the highest frequencies in each column are transformed into one hot codes. The rest, including missing values, belong to one single infrequent class.

In [9]:

from sklearn.preprocessing import OneHotEncoder

cat_pipeline = Pipeline([
    ('encode', OneHotEncoder(max_categories=5, handle_unknown='infrequent_if_exist'))
])

Putting all together

With the pipelines ready, we can now combine everything into one single complete data pipeline using the composer ColumnTransformer() in SKLearn. The syntax to create a ColumnTransformer() is fairly similar to a Pipeline in that you need to provide a list of transformers. However, now, you also need to specify the list of columns to apply each transformer too. Overall, the syntax is as below

In [ ]:

full_pipeline = ColumnTransformer([
    ('<pipeline name 1>', <pipeline 1>, <list 1>),
    ('<pipeline name 2>', <pipeline 2>, <list 2>),
    ...
])

Back to our students data, first, we create three lists, log_cols with only FamilyIncome, num_cols for the remaining numeric columns, and cat_cols for categorical ones. Then, we can now have our ColumnTransformer with the three previously created pipelines and their targets. After this point, fitting and applying are just like before with fit_transform() and transform(). Once more time, do not use fit_transform() on testing data!

In [10]:

from sklearn.compose import ColumnTransformer

log_cols = ['FamilyIncome']
num_cols = ['HighSchoolGPA', 'AvgDailyStudyTime', 'TotalAbsence', 'FirstYearGPA']
cat_cols = ['Major','State']

full_pipeline = ColumnTransformer([
    ('log trans', log_pipeline, log_cols),
    ('numeric', num_pipeline, num_cols),
    ('class', cat_pipeline, cat_cols)
])

In [14]:

train_processed = full_pipeline.fit_transform(train)
train_processed.shape

Out[14]:

(800, 14)

In [21]:

test_processed = full_pipeline.transform(test)
test_processed.shape

Out[21]:

(200, 14)

After this, we have a training data set and a testing data set error free (supposedly) and ready for any further analysis. Right now though, I am just gonna show you their histograms. On the first look, the skewness in FamilyIncome (column 4) has been fixed. There are some differences between training and testing here and there, but nothing serious. This is just mostly because of the small data size. Also, you can see several new binary columns which are created from one hot encoders.

In [17]:

pd.DataFrame(train_processed).hist(figsize=(8,10))
plt.show()

In [22]:

pd.DataFrame(test_processed).hist(figsize=(8,10))
plt.show()

Overall, you may think that defining pipeline this way is more complicated since it involves more concepts. However, as you get used to SKLearn, this is among the best way. Pipeline code is very nicely organized and easy to maintain. And, fitting or transforming with pipeline, is just as easy. So, I highly recommend a deep understanding about this topic.

Wrapping up

With this post, we should now have a concrete understanding on preprocessing (tabular) data. The several methods that we learned, while definitely not enough, cover most of the common issues that we expect in data. So, be confident, as you can process data fairly well now! After this post, we will get into modeling data with more exciting stuffs, so see you again.

Data Pipeline