Now that we have had a good idea on what to do or can be done during an exploratory analysis, it is time to move on to data preprocessing! So, what do we do in data processing, and why is that? If you remember, my most recent post was about analyzing issues with your data. Surely we do not just do that for fun. We find issues to fix them. And to do that, we have to perform certain types of transformations on data. This step is data cleaning (or data cleansing), one part of data preprocessing. Another task we would like to do is to engineer new features that are beneficial to our analysis. In this post, we will go through one step in first task, more specifically, handling outliers. There are three common ways to do this, and we will now discuss them one by one.
Loading and examining data
Like usual, we use the students1000.csv
to demonstrate. This time, we will import numpy
for some transformation. Finally, because StudentID
is not of interests, we drop it for now. The data is below, and you can download the complete Jupyer notebook here.
import pandas as pd
import numpy as np
data = pd.read_csv('students1000.csv')
data = data.drop('StudentID', axis=1)
data.head(2)
FirstName | LastName | Major | HighSchoolGPA | FamilyIncome | State | AvgDailyStudyTime | TotalAbsence | FirstYearGPA | |
---|---|---|---|---|---|---|---|---|---|
0 | Baxter | Dengler | Computer Science | 2.82 | 45013 | WA | 2.01 | 14.0 | 1.93 |
1 | Christian | Wickey | Data Science | 3.07 | 128358 | GA | 5.41 | NaN | 2.76 |
Next, we can take a look at the histograms and descriptive statistics to quickly scan for outliers.
import matplotlib.pyplot as plt
data.hist(bins=20, figsize=(8,8))
plt.show()
data.describe()
HighSchoolGPA | FamilyIncome | AvgDailyStudyTime | TotalAbsence | FirstYearGPA | |
---|---|---|---|---|---|
count | 1000.000000 | 1.000000e+03 | 985.000000 | 990.000000 | 1000.000000 |
mean | 3.017420 | 1.393459e+05 | 6.132305 | 17.222222 | 2.702000 |
std | 0.491055 | 2.009381e+05 | 2.346976 | 5.785613 | 0.546332 |
min | 1.390000 | 1.737800e+04 | 0.000000 | 1.000000 | 1.200000 |
25% | 2.670000 | 5.031125e+04 | 4.610000 | 13.000000 | 2.360000 |
50% | 3.030000 | 9.203800e+04 | 6.180000 | 17.000000 | 2.695000 |
75% | 3.340000 | 1.741448e+05 | 7.760000 | 21.000000 | 3.072500 |
max | 4.000000 | 4.125854e+06 | 13.970000 | 37.000000 | 4.000000 |
Based on the above results, I would say only FamilyIncome
has clear issues. Finally, we will create a list of all numeric columns for a bit more convenience in writing codes later on. Now, we will discuss methods of handling outliers.
num_cols = ['HighSchoolGPA','FamilyIncome','AvgDailyStudyTime','TotalAbsence','FirstYearGPA']
Dropping rows with outliers
If you don’t like it, don’t use it. Yes, that spirit does apply here. We can simply remove rows with outliers from the data. However, I would not recommend this as your default go-to for handling outliers. There are a few reasons. First, doing this, all data in the rows, not just those extreme values, are gone. So, we are losing data by doing this. With more columns, you may potentially sacrifice a good chunk of valuable data. Second, outliers, while annoying, are there for a reason. They may come from wrong data settings or mistakes during data collection which are okay to remove. However, they may also just occur naturally while uncommonly, in which case, dropping them would bias your analysis more than not. Finally, even when outliers are wrong measurement, there are better ways to handle them. So, I am listing this method here for completion purpose only.
Using mean and standard deviation
Probably the most common way to detect outliers is to use their distances from the mean. In this case, the distance is calculated as a number of standard deviation, for example, 4
. In other words, any values beyond 4
times standard deviation from the mean is considered outlier. Common values of factor here are from 3
to 5
.
The code below performs outlier removal using a standard deviation factor of n_sd
. You can reuse this for other data sets (assuming they are in a dataframe called data
) just by changing the list num_cols. Looking at describe()
results afterward, we can see the number of rows in the dataframe becomes 971
, meaning 29
were dropped. The max of FamilyIncome
is now 866k
, and the rest stay the same. So using this method, only FamilyIncome
is considered having outliers.
n_sd = 4
data_nums = data[num_cols]
data_means = data_nums.mean()
data_stds = data_nums.std()
data_drp_outl = data[(np.abs(data_nums - data_means)/data_stds < n_sd).all(axis=1)]
data_drp_outl.describe()
HighSchoolGPA | FamilyIncome | AvgDailyStudyTime | TotalAbsence | FirstYearGPA | |
---|---|---|---|---|---|
count | 971.000000 | 971.000000 | 971.000000 | 971.000000 | 971.000000 |
mean | 3.016200 | 129266.330587 | 6.141936 | 17.226571 | 2.703522 |
std | 0.489568 | 108027.666420 | 2.345634 | 5.801542 | 0.546798 |
min | 1.390000 | 17378.000000 | 0.000000 | 1.000000 | 1.200000 |
25% | 2.665000 | 50046.500000 | 4.610000 | 13.000000 | 2.360000 |
50% | 3.030000 | 90722.000000 | 6.190000 | 17.000000 | 2.690000 |
75% | 3.340000 | 173138.000000 | 7.770000 | 21.000000 | 3.080000 |
max | 4.000000 | 866534.000000 | 13.970000 | 37.000000 | 4.000000 |
Code explanation
Line by line, the code does the following (you can skip this part if not interested in technical numpy
):
1. set the factor n_sd = 4
(again, changeable to your preference)
2. create a slice data_nums
with only numeric columns for easier access
3. get the means of all numeric columns and store them in data_means
4. get the standard deviation of all numeric columns and store them in data_stds
5. slice those in between n_sd
standard deviations from the mean. First, the expression (np.abs(data_nums - data_means)/data_stds < n_sd)
returns an 2D array of the same shape with data_nums
with element being True
if the original value is okay, and False if outliers
. Call all(axis=1)
from the array result in a 1D array where an element is True
if the row do not contain outliers, and False
otherwise. Finally, using the 1D array in slicing yields the dataframe with all rows having outliers dropped. I would recommend running each part separately and observe their results to truly understand what happens here.
Using median and interquartile range
As we discussed some times ago, outliers influence the mean and may make it misleading as the true center in data. Because standard deviation is calculated using the mean, it gets exaggerated as well. In cases where data is very skewed with big outliers like FamilyIncome
, we may prefer using the median and interquartile – IQR based distance. The idea, however, is similar as before. Values that are above a certain number of IQR from the median are considered outliers. The IQR factor is usually from 2.2
to 4
.
In terms of code, it is very similar like before, we just change to calculate median and IQR instead of mean and standard deviation. You can just reuse by changing n_iqr
and the num_cols
list. Here, I use n_iqr
of 2.5
(which is actually weaker than n_sd
of 4
), however, we can see more rows are dropped. The max FamilyIncome
is now just a bit below 400k
.
n_iqr = 2.5
data_nums = data[num_cols]
data_med = data_nums.median()
data_iqr = data_nums.quantile(0.75) - data_nums.quantile(0.25)
data_drp_outl = data[(np.abs(data_nums - data_med)/data_iqr < n_iqr).all(axis=1)]
data_drp_outl.describe()
HighSchoolGPA | FamilyIncome | AvgDailyStudyTime | TotalAbsence | FirstYearGPA | |
---|---|---|---|---|---|
count | 942.000000 | 942.000000 | 942.000000 | 942.000000 | 942.000000 |
mean | 3.021964 | 118544.063694 | 6.176327 | 17.233546 | 2.710679 |
std | 0.488518 | 87823.709141 | 2.332453 | 5.726369 | 0.546230 |
min | 1.390000 | 17378.000000 | 0.000000 | 1.000000 | 1.200000 |
25% | 2.670000 | 49324.750000 | 4.630000 | 13.000000 | 2.372500 |
50% | 3.030000 | 87877.000000 | 6.230000 | 17.000000 | 2.695000 |
75% | 3.350000 | 164070.250000 | 7.780000 | 21.000000 | 3.080000 |
max | 4.000000 | 399778.000000 | 13.970000 | 34.000000 | 4.000000 |
Dropping fields with outliers
A less extreme way of removing outliers is to replace them with nan instead of deleting the whole row. This method, of course, makes these values become missing which requires another process of handling missing data to be completely fixed. Regardless, it is more acceptable than removing rows, so you can try if interested. Similar like before, we still rely on the mean and standard deviation or median and IQR to determine outliers before marking them as missing. First is an example with mean and standard deviation:
n_sd = 4
data_drp_outl = data.copy()
data_nums = data[num_cols]
data_means = data_nums.mean()
data_stds = data_nums.std()
data_drp_outl[np.abs(data_nums - data_means)/data_stds > n_sd] = np.nan
data_drp_outl.describe()
HighSchoolGPA | FamilyIncome | AvgDailyStudyTime | TotalAbsence | FirstYearGPA | |
---|---|---|---|---|---|
count | 1000.000000 | 996.000000 | 985.000000 | 990.000000 | 1000.000000 |
mean | 3.017420 | 129729.229920 | 6.132305 | 17.222222 | 2.702000 |
std | 0.491055 | 108286.605765 | 2.346976 | 5.785613 | 0.546332 |
min | 1.390000 | 17378.000000 | 0.000000 | 1.000000 | 1.200000 |
25% | 2.670000 | 50218.250000 | 4.610000 | 13.000000 | 2.360000 |
50% | 3.030000 | 91606.000000 | 6.180000 | 17.000000 | 2.695000 |
75% | 3.340000 | 173164.250000 | 7.760000 | 21.000000 | 3.072500 |
max | 4.000000 | 866534.000000 | 13.970000 | 37.000000 | 4.000000 |
And using median and IQR:
n_iqr = 2.5
data_drp_outl = data.copy()
data_nums = data[num_cols]
data_med = data_nums.median()
data_iqr = data_nums.quantile(0.75) - data_nums.quantile(0.25)
data_drp_outl[np.abs(data_nums - data_med)/data_iqr > n_iqr] = np.nan
data_drp_outl.describe()
HighSchoolGPA | FamilyIncome | AvgDailyStudyTime | TotalAbsence | FirstYearGPA | |
---|---|---|---|---|---|
count | 1000.000000 | 966.000000 | 985.000000 | 990.000000 | 1000.000000 |
mean | 3.017420 | 118518.401656 | 6.132305 | 17.222222 | 2.702000 |
std | 0.491055 | 87497.528478 | 2.346976 | 5.785613 | 0.546332 |
min | 1.390000 | 17378.000000 | 0.000000 | 1.000000 | 1.200000 |
25% | 2.670000 | 49324.750000 | 4.610000 | 13.000000 | 2.360000 |
50% | 3.030000 | 88634.000000 | 6.180000 | 17.000000 | 2.695000 |
75% | 3.340000 | 163885.250000 | 7.760000 | 21.000000 | 3.072500 |
max | 4.000000 | 399778.000000 | 13.970000 | 37.000000 | 4.000000 |
Now, we can see, only the number of valid values in FamilyIncome
drops in both cases, the rest of the columns are unchanged. There are just a few changes in the codes. First, we now make a copy of the whole dataframe from the beginning since we will directly slice and modify it by the end. Second, the slice changed to filter for outliers instead of regular values so that we can set them to nan.
Clipping outliers
Dropping outliers is not a good way of handling outliers in general. If you are sure outliers come from errors and not natural, you can clip them instead. Clipping means to choose a range, and any values outside of the range are set to the closest boundary. For example, if I have an array of [-100, -10, -5, 4, 8, 10, 14, 1000]
and choose a clip range of -10
to 10,
the clipped array becomes [-10, -10, -5, 4, 8, 10, 10, 10]
.
In Python, we use the function clip()
from numpy
to perform this operation. It takes three inputs, the array or dataframe to clip, the lower bound, and the upper bound. We will also set option axis=1
to clip by rows. Similar like dropping, we can set the range use mean and standard deviation or median and IQR. The example for the students data using the mean and standard deviation is as below.
n_sd = 4
data_clp_outl = data.copy()
data_nums = data[num_cols]
data_means = data_nums.mean()
data_stds = data_nums.std()
data_clp_outl[num_cols] = np.clip(data_nums, data_means - n_sd*data_stds, data_means + n_sd*data_stds, axis=1)
data_clp_outl.describe()
HighSchoolGPA | FamilyIncome | AvgDailyStudyTime | TotalAbsence | FirstYearGPA | |
---|---|---|---|---|---|
count | 1000.000000 | 1000.000000 | 985.000000 | 990.000000 | 1000.000000 |
mean | 3.017420 | 132982.706401 | 6.132305 | 17.222222 | 2.702000 |
std | 0.491055 | 119655.207186 | 2.346976 | 5.785613 | 0.546332 |
min | 1.390000 | 17378.000000 | 0.000000 | 1.000000 | 1.200000 |
25% | 2.670000 | 50311.250000 | 4.610000 | 13.000000 | 2.360000 |
50% | 3.030000 | 92038.000000 | 6.180000 | 17.000000 | 2.695000 |
75% | 3.340000 | 174144.750000 | 7.760000 | 21.000000 | 3.072500 |
max | 4.000000 | 943098.350343 | 13.970000 | 37.000000 | 4.000000 |
Using median and IQR:
n_iqr = 2.5
data_clp_outl = data.copy()
data_nums = data[num_cols]
data_med = data_nums.median()
data_iqr = data_nums.quantile(0.75) - data_nums.quantile(0.25)
data_clp_outl[num_cols] = np.clip(data_nums, data_med - n_iqr*data_iqr, data_med + n_iqr*data_iqr, axis=1)
data_clp_outl.describe()
HighSchoolGPA | FamilyIncome | AvgDailyStudyTime | TotalAbsence | FirstYearGPA | |
---|---|---|---|---|---|
count | 1000.000000 | 1000.000000 | 985.000000 | 990.000000 | 1000.000000 |
mean | 3.017420 | 128143.915500 | 6.132305 | 17.222222 | 2.702000 |
std | 0.491055 | 100151.179815 | 2.346976 | 5.785613 | 0.546332 |
min | 1.390000 | 17378.000000 | 0.000000 | 1.000000 | 1.200000 |
25% | 2.670000 | 50311.250000 | 4.610000 | 13.000000 | 2.360000 |
50% | 3.030000 | 92038.000000 | 6.180000 | 17.000000 | 2.695000 |
75% | 3.340000 | 174144.750000 | 7.760000 | 21.000000 | 3.072500 |
max | 4.000000 | 401621.750000 | 13.970000 | 37.000000 | 4.000000 |
In both cases, the dataframe still has 1000
rows after handling outliers. However, we can still see the difference between the two distance methods. The mean approach keep the max FamilyIncome
at over 900k
, and the median approach 401k
. Which one is better? You decide!
Leave them be
Yes, you see it correctly. Just leave them be. This is a valid way of handling outliers too. Especially in data where outliers are expected, naturally occur, and are more than just a few (over 2-3% data is outliers). Furthermore, there are certain transformations like logarithm that will deflate the influences of outliers by a lot, however we will discuss this in another post.
Conclusion
In this post, we have discussed quite a few methods of handling outliers. And just like determining outliers, these are fairly subjective to you as the analysis, and are situational on the data sets as well. So, it is always good to try to understand your data and outliers before attempting to treat them. This post is very long now, so I will see you again, and happy analyzing!
Pingback: Handle Missing Data - Data Science from a Practical Perspective
Pingback: Processing Pipeline - Data Science from a Practical Perspective