Predictive analysis is a major branch of data analytics where we want to apply knowledge learned from historical data on new data. However, there is one potential issue in this type of analysis. Sometimes, the knowledge that we learned is too specific to the historical data and fails to generalize to new data. This effect is called overfitting. Therefore, it is probably a good idea to verify if what we learned is good or not. Unfortunately, we cannot have that new data to evaluate the obtained knowledge since it must come after the analysis. This lead to one approach to simulate that both historical and new data are available. We perform a train-test split on all the data we have at the moment. So, let us dive in and see how this works.

Train-test split

Train-test split means to split all rows in the data into to portion, one for training and for for testing. The training portion is to simulate the historical data, and testing portion new data. After this point, we carry on with the analysis as if we only have the training data on hands. The testing data is left alone until after having the final knowledge. Its purpose is only for evaluation and nothing else. Of course, if we need to apply all preprocessing transformations on both data sets because they should undergo the same treatments. However, necessary calculations, for examples, mean and standard deviations for scaling, or median for imputing, are all calculated from the training data. We will discuss this in further detail when we get to processing pipeline. For now, let us continue to methods to split data.

Data to demonstrate

In this post, I will use the students-honor.csv data. It has three numeric columns, TutorSessions, AccumCredit, and GPA, and one binary column isHonor. You can download the data below, and get the complete Jupyter notebook here. We perform a simple shape and see that there are 1000 rows in the data. Also, we draw histograms for all columns to have references later on.

students-honor.csv Download

In [4]:

import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('students-honor.csv')
print(data.shape)
data.hist(figsize=(8,8))
plt.show()

(1000, 4)

Random split

Just like how it sounds, this method split the rows randomly. We start with selecting a ratio, for example, 70% training and 30% testing. Then, 70% of the rows will be randomly selected to add to the training set, and 30% testing set. And that is it!

To perform a random split in Python, we can use the train_test_split() function from sklearn.model_selection. For our current usage, this function takes input as the whole data set, and outputs the training and testing portions as two dataframes. We also set test_size=0.2 indicating a split of 80% training 20% testing. In terms of result, there are training rows and 200 testing rows which match the defined rate exactly. We can also compare the histograms between the two portions and verify that they are very similar to those before splitting. While you can see some differences, these are fairly minor. Furthermore, the testing data is fairly small with 200 rows, so its histograms tend to be a bit more “jagged”.

In [17]:

from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.2)

train.shape, test.shape

Out[17]:

((800, 4), (200, 4))

In [18]:

train.hist(figsize=(8,8))
plt.show()

In [19]:

test.hist(figsize=(8,8))
plt.show()

Most of the time, a random split is very effective in terms of preserving data distributions (the same between before and after, and training and testing). In some rare cases though, we may want absolutely similar distributions in the training and testing. This criterion, however, is not guaranteed in random splits. So, we must opt to use a stratified split.

Stratified split

In this method, we can select a stratifying column of which distribution will be as similar as possible between the training data and testing data after splitting. In SKLearn, we still use train_test_split(), with a stratify= option added to define the stratifying column. Note that, at this moment, this function only accept a single categorical column for stratifying. Below, we perform the split with isHonor being the stratifying column. As you can see, its distributions are exactly the same between the training and testing data now.

In [20]:

train, test = train_test_split(data, test_size=0.2, stratify=data['isHonor'])

train.shape, test.shape

Out[20]:

((800, 4), (200, 4))

In [21]:

train.hist(figsize=(8,8))
plt.show()

In [22]:

test.hist(figsize=(8,8))
plt.show()

Wrapping up

In this post, we have discussed the justification and methods for train-test splitting. While simple in both concepts and performing, this is a very important step in any predictive analyses. At least, for me personally, I always split data before any major steps in my analysis. And I would highly recommend you to adapt this practice too. Finally, I will end this post now. See you next time!

Train-test Split

Train-test split

Data to demonstrate

Random split

Stratified split

Wrapping up

1 Comment