an illustration of a processing pipeline with two steps, standardization and imputation

At this point, we have gone through quite some preprocessing methods for different issues in data such as handling outliers, scaling, imputation, encoding, etc. So, I think it is now time for us to learn how we put them together. Since each step only deals with one specific problem, we need to sequentially perform them one by one to obtain the final result. For example, first, we perform a standardization on numeric columns. Next, the standardized data undergoes imputation. The imputed data then becomes the input for the next step, and so on. This is referred to as a processing pipeline which is the topic of this post. So, let us not wait any longer and start right now!

Getting started

We will use the students-numeric.csv data throughout this post. You can download the data below and access the complete Jupyter notebook here. Now that we have discussed train-test split and its necessity, let us do that right after reading the data in. Next, we do a quick info() and plot histograms to get some information in the training data. There seem to be only two issues: some missing values, and the largely different columns’ scales. Therefore, we will perform imputation and standardization on this data.

students-numeric.csv Download

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

data = pd.read_csv('students-numeric.csv')

In [2]:

from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.2)

In [3]:

train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 800 entries, 803 to 695
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   HighSchoolGPA      800 non-null    float64
 1   HSRankPercent      800 non-null    int64  
 2   AvgDailyStudyTime  785 non-null    float64
 3   TotalAbsence       791 non-null    float64
 4   SATMath            800 non-null    int64  
 5   SATVerbal          800 non-null    int64  
 6   FirstYearCredit    800 non-null    int64  
 7   FirstYearGPA       800 non-null    float64
dtypes: float64(4), int64(4)
memory usage: 56.2 KB

In [4]:

data.hist(figsize=(8,8))
plt.show()

You may realize one small issue now. Which do we do first, imputation or standardization? It really depends on the data and the analyst’s interpretation. In this cases, I prefer to perform the latter first. The reason is that imputation, regardless of methods, adds synthetic data that could affect the estimations of means and standard deviations in standardization. To sum up, our pipeline is now data -> standardization -> imputation.

Building a processing pipeline manually

We can surely just write each step in our pipeline and just be sure to connect them. So, the code can go like below. As you can see, we create scaler for standardization and imputer for imputation then applying them on the data with fit_transform(). The thing to note here is that the output of the scaler, train_scaled, is the input of the imputer. And if there were another step after imputer, it would take imputer‘s output as input. In this case, imputer is the last step, so I save its output to the final dataframe train_processed. Upon checking the processed data, we can see the issues of missing values and variable scales have been solved.

In [5]:

from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

scaler = StandardScaler()
imputer = SimpleImputer(strategy='median')

train_processed = train.copy()
train_scaled = scaler.fit_transform(train)
train_processed[:] = imputer.fit_transform(train_scaled)

In [6]:

train_processed.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 800 entries, 803 to 695
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   HighSchoolGPA      800 non-null    float64
 1   HSRankPercent      800 non-null    float64
 2   AvgDailyStudyTime  800 non-null    float64
 3   TotalAbsence       800 non-null    float64
 4   SATMath            800 non-null    float64
 5   SATVerbal          800 non-null    float64
 6   FirstYearCredit    800 non-null    float64
 7   FirstYearGPA       800 non-null    float64
dtypes: float64(8)
memory usage: 56.2 KB

In [7]:

train_processed.hist(figsize=(8,8))
plt.show()

Now let us say we finish our analysis and want to apply the result to the testing data. The first thing we have to do is to transform it exactly like the training data. This is also not difficult like in the code below. Do you notice the differences with the training codes? First, we no longer create new processing models but use the previous one. Second, we use transform() instead of fit_transform(). This is very important – never use fit_transform() on testing data (or anything beside training data). In SKLearn, using fit_transform() tells the models to estimate parameters, so we only do that during training phase. For testing, we use transform() which only applies without new estimations.

In [8]:

test_processed = test.copy()

test_scaled = scaler.transform(test)
test_processed[:] = imputer.transform(test_scaled)

In [9]:

test_processed.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 200 entries, 398 to 641
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   HighSchoolGPA      200 non-null    float64
 1   HSRankPercent      200 non-null    float64
 2   AvgDailyStudyTime  200 non-null    float64
 3   TotalAbsence       200 non-null    float64
 4   SATMath            200 non-null    float64
 5   SATVerbal          200 non-null    float64
 6   FirstYearCredit    200 non-null    float64
 7   FirstYearGPA       200 non-null    float64
dtypes: float64(8)
memory usage: 14.1 KB

In [10]:

test_processed.hist(figsize=(8,8))
plt.show()

Any issues?

So, the process is correct and okay, but it is very inconvenient. This is a small pipeline with only two steps, so the code look short enough. However, now imagine one with more steps, and you want to apply the transformations to more data sets besides training and testing. In such cases, you will have to repeat the same codes several times, which is not efficient and difficult to maintain or update. This brings us to the SKLearn pipeline.

Building a processing pipeline with SKLearn

Very fortunately, SKLearn comes with a Pipeline transformer that allows incorporation of all preprocessing steps into one continuous sequence. Fitting and applying them are also unified into a single function called. So, let us take a look. The syntax create a pipeline is as below.

In [ ]:

<pipeline name> = Pipeline([
    ('<step 1 name>', <step 1 model>),
    ('<step 2 name>', <step 2 model>),
    ...
])

Of course, first we give the pipeline a name and use the Pipeline class to create one. The input is a sequence of transformers, each being a tuple of the step’s name and model. Here, the name can be a string, and the model can be similar to create an empty one like StandardScaler() or SimpleImputer(strategy='median').

Now, let us take a look at building a pipeline for our students data. We start with importing Pipeline from sklearn.pipeline. Next, we create one called data_pipeline with two steps: a StandardScaler() named ‘scale‘ and a SimpleImputer() called ‘impute‘ using median strategy. After that, fitting and applying the pipeline is exactly like any processing models with fit_transform() and transform(). Again, please do not forget to not use fit_transform() on testing data! I will not draw the histograms again since they are exactly the same as before. Instead, you can try it yourself, or see the result in the provided notebook.

In [11]:

from sklearn.pipeline import Pipeline

data_pipeline = Pipeline([
    ('scale', StandardScaler()),
    ('impute', SimpleImputer(strategy='median'))
])

train_processed = train.copy()
train_processed[:] = data_pipeline.fit_transform(train)

In [13]:

test_processed = test.copy()
test_processed[:] = data_pipeline.transform(test)

Conclusion

With this post, we now know the concept of a very powerful tool – a processing pipeline. Pipelines tremendously eases a lot of tasks for us analysts, so you should really take some times to try to understand them. With that, I will now conclude this post. See you again soon!

Processing Pipeline

Getting started

Building a processing pipeline manually

Any issues?

Building a processing pipeline with SKLearn

Conclusion

1 Comment