At this point, we have gone through quite some preprocessing methods for different issues in data such as handling outliers, scaling, imputation, encoding, etc. So, I think it is now time for us to learn how we put them together. Since each step only deals with one specific problem, we need to sequentially perform them one by one to obtain the final result. For example, first, we perform a standardization on numeric columns. Next, the standardized data undergoes imputation. The imputed data then becomes the input for the next step, and so on. This is referred to as a processing pipeline which is the topic of this post. So, let us not wait any longer and start right now!
Getting started
We will use the students-numeric.csv data throughout this post. You can download the data below and access the complete Jupyter notebook here. Now that we have discussed train-test split and its necessity, let us do that right after reading the data in. Next, we do a quick info() and plot histograms to get some information in the training data. There seem to be only two issues: some missing values, and the largely different columns’ scales. Therefore, we will perform imputation and standardization on this data.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv('students-numeric.csv')
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size=0.2)
train.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 800 entries, 803 to 695 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 HighSchoolGPA 800 non-null float64 1 HSRankPercent 800 non-null int64 2 AvgDailyStudyTime 785 non-null float64 3 TotalAbsence 791 non-null float64 4 SATMath 800 non-null int64 5 SATVerbal 800 non-null int64 6 FirstYearCredit 800 non-null int64 7 FirstYearGPA 800 non-null float64 dtypes: float64(4), int64(4) memory usage: 56.2 KB
data.hist(figsize=(8,8))
plt.show()
You may realize one small issue now. Which do we do first, imputation or standardization? It really depends on the data and the analyst’s interpretation. In this cases, I prefer to perform the latter first. The reason is that imputation, regardless of methods, adds synthetic data that could affect the estimations of means and standard deviations in standardization. To sum up, our pipeline is now data -> standardization -> imputation
.
Building a processing pipeline manually
We can surely just write each step in our pipeline and just be sure to connect them. So, the code can go like below. As you can see, we create scaler
for standardization and imputer
for imputation then applying them on the data with fit_transform()
. The thing to note here is that the output of the scaler
, train_scaled
, is the input of the imputer
. And if there were another step after imputer
, it would take imputer
‘s output as input. In this case, imputer
is the last step, so I save its output to the final dataframe train_processed
. Upon checking the processed data, we can see the issues of missing values and variable scales have been solved.
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
scaler = StandardScaler()
imputer = SimpleImputer(strategy='median')
train_processed = train.copy()
train_scaled = scaler.fit_transform(train)
train_processed[:] = imputer.fit_transform(train_scaled)
train_processed.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 800 entries, 803 to 695 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 HighSchoolGPA 800 non-null float64 1 HSRankPercent 800 non-null float64 2 AvgDailyStudyTime 800 non-null float64 3 TotalAbsence 800 non-null float64 4 SATMath 800 non-null float64 5 SATVerbal 800 non-null float64 6 FirstYearCredit 800 non-null float64 7 FirstYearGPA 800 non-null float64 dtypes: float64(8) memory usage: 56.2 KB
train_processed.hist(figsize=(8,8))
plt.show()
Now let us say we finish our analysis and want to apply the result to the testing data. The first thing we have to do is to transform it exactly like the training data. This is also not difficult like in the code below. Do you notice the differences with the training codes? First, we no longer create new processing models but use the previous one. Second, we use transform()
instead of fit_transform()
. This is very important – never use fit_transform()
on testing data (or anything beside training data). In SKLearn, using fit_transform()
tells the models to estimate parameters, so we only do that during training phase. For testing, we use transform()
which only applies without new estimations.
test_processed = test.copy()
test_scaled = scaler.transform(test)
test_processed[:] = imputer.transform(test_scaled)
test_processed.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 200 entries, 398 to 641 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 HighSchoolGPA 200 non-null float64 1 HSRankPercent 200 non-null float64 2 AvgDailyStudyTime 200 non-null float64 3 TotalAbsence 200 non-null float64 4 SATMath 200 non-null float64 5 SATVerbal 200 non-null float64 6 FirstYearCredit 200 non-null float64 7 FirstYearGPA 200 non-null float64 dtypes: float64(8) memory usage: 14.1 KB
test_processed.hist(figsize=(8,8))
plt.show()
Any issues?
So, the process is correct and okay, but it is very inconvenient. This is a small pipeline with only two steps, so the code look short enough. However, now imagine one with more steps, and you want to apply the transformations to more data sets besides training and testing. In such cases, you will have to repeat the same codes several times, which is not efficient and difficult to maintain or update. This brings us to the SKLearn pipeline.
Building a processing pipeline with SKLearn
Very fortunately, SKLearn comes with a Pipeline
transformer that allows incorporation of all preprocessing steps into one continuous sequence. Fitting and applying them are also unified into a single function called. So, let us take a look. The syntax create a pipeline is as below.
<pipeline name> = Pipeline([
('<step 1 name>', <step 1 model>),
('<step 2 name>', <step 2 model>),
...
])
Of course, first we give the pipeline a name and use the Pipeline
class to create one. The input is a sequence of transformers, each being a tuple of the step’s name and model. Here, the name can be a string, and the model can be similar to create an empty one like StandardScaler()
or SimpleImputer(strategy='median')
.
Now, let us take a look at building a pipeline for our students data. We start with importing Pipeline
from sklearn.pipeline
. Next, we create one called data_pipeline
with two steps: a StandardScaler()
named ‘scale
‘ and a SimpleImputer()
called ‘impute
‘ using median strategy. After that, fitting and applying the pipeline is exactly like any processing models with fit_transform()
and transform()
. Again, please do not forget to not use fit_transform()
on testing data! I will not draw the histograms again since they are exactly the same as before. Instead, you can try it yourself, or see the result in the provided notebook.
from sklearn.pipeline import Pipeline
data_pipeline = Pipeline([
('scale', StandardScaler()),
('impute', SimpleImputer(strategy='median'))
])
train_processed = train.copy()
train_processed[:] = data_pipeline.fit_transform(train)
test_processed = test.copy()
test_processed[:] = data_pipeline.transform(test)
Conclusion
With this post, we now know the concept of a very powerful tool – a processing pipeline. Pipelines tremendously eases a lot of tasks for us analysts, so you should really take some times to try to understand them. With that, I will now conclude this post. See you again soon!
Pingback: Data Pipeline - Data Science from a Practical Perspective