an illustration of handling outliers methods

Now that we have had a good idea on what to do or can be done during an exploratory analysis, it is time to move on to data preprocessing! So, what do we do in data processing, and why is that? If you remember, my most recent post was about analyzing issues with your data. Surely we do not just do that for fun. We find issues to fix them. And to do that, we have to perform certain types of transformations on data. This step is data cleaning (or data cleansing), one part of data preprocessing. Another task we would like to do is to engineer new features that are beneficial to our analysis. In this post, we will go through one step in first task, more specifically, handling outliers. There are three common ways to do this, and we will now discuss them one by one.

Loading and examining data

Like usual, we use the students1000.csv to demonstrate. This time, we will import numpy for some transformation. Finally, because StudentID is not of interests, we drop it for now. The data is below, and you can download the complete Jupyer notebook here.

students1000.csv Download

In [1]:

import pandas as pd
import numpy as np

data = pd.read_csv('students1000.csv')
data = data.drop('StudentID', axis=1)
data.head(2)

Out[1]:

	FirstName	LastName	Major	HighSchoolGPA	FamilyIncome	State	AvgDailyStudyTime	TotalAbsence	FirstYearGPA
0	Baxter	Dengler	Computer Science	2.82	45013	WA	2.01	14.0	1.93
1	Christian	Wickey	Data Science	3.07	128358	GA	5.41	NaN	2.76

Next, we can take a look at the histograms and descriptive statistics to quickly scan for outliers.

In [22]:

import matplotlib.pyplot as plt

data.hist(bins=20, figsize=(8,8))
plt.show()

In [3]:

data.describe()

Out[3]:

	HighSchoolGPA	FamilyIncome	AvgDailyStudyTime	TotalAbsence	FirstYearGPA
count	1000.000000	1.000000e+03	985.000000	990.000000	1000.000000
mean	3.017420	1.393459e+05	6.132305	17.222222	2.702000
std	0.491055	2.009381e+05	2.346976	5.785613	0.546332
min	1.390000	1.737800e+04	0.000000	1.000000	1.200000
25%	2.670000	5.031125e+04	4.610000	13.000000	2.360000
50%	3.030000	9.203800e+04	6.180000	17.000000	2.695000
75%	3.340000	1.741448e+05	7.760000	21.000000	3.072500
max	4.000000	4.125854e+06	13.970000	37.000000	4.000000

Based on the above results, I would say only FamilyIncome has clear issues. Finally, we will create a list of all numeric columns for a bit more convenience in writing codes later on. Now, we will discuss methods of handling outliers.

In [5]:

num_cols = ['HighSchoolGPA','FamilyIncome','AvgDailyStudyTime','TotalAbsence','FirstYearGPA']

Dropping rows with outliers

If you don’t like it, don’t use it. Yes, that spirit does apply here. We can simply remove rows with outliers from the data. However, I would not recommend this as your default go-to for handling outliers. There are a few reasons. First, doing this, all data in the rows, not just those extreme values, are gone. So, we are losing data by doing this. With more columns, you may potentially sacrifice a good chunk of valuable data. Second, outliers, while annoying, are there for a reason. They may come from wrong data settings or mistakes during data collection which are okay to remove. However, they may also just occur naturally while uncommonly, in which case, dropping them would bias your analysis more than not. Finally, even when outliers are wrong measurement, there are better ways to handle them. So, I am listing this method here for completion purpose only.

Using mean and standard deviation

Probably the most common way to detect outliers is to use their distances from the mean. In this case, the distance is calculated as a number of standard deviation, for example, 4. In other words, any values beyond 4 times standard deviation from the mean is considered outlier. Common values of factor here are from 3 to 5.

The code below performs outlier removal using a standard deviation factor of n_sd. You can reuse this for other data sets (assuming they are in a dataframe called data) just by changing the list num_cols. Looking at describe() results afterward, we can see the number of rows in the dataframe becomes 971, meaning 29 were dropped. The max of FamilyIncome is now 866k, and the rest stay the same. So using this method, only FamilyIncome is considered having outliers.

In [11]:

n_sd = 4 

data_nums = data[num_cols]
data_means = data_nums.mean() 
data_stds = data_nums.std()
data_drp_outl = data[(np.abs(data_nums - data_means)/data_stds < n_sd).all(axis=1)]

In [13]:

data_drp_outl.describe()

Out[13]:

	HighSchoolGPA	FamilyIncome	AvgDailyStudyTime	TotalAbsence	FirstYearGPA
count	971.000000	971.000000	971.000000	971.000000	971.000000
mean	3.016200	129266.330587	6.141936	17.226571	2.703522
std	0.489568	108027.666420	2.345634	5.801542	0.546798
min	1.390000	17378.000000	0.000000	1.000000	1.200000
25%	2.665000	50046.500000	4.610000	13.000000	2.360000
50%	3.030000	90722.000000	6.190000	17.000000	2.690000
75%	3.340000	173138.000000	7.770000	21.000000	3.080000
max	4.000000	866534.000000	13.970000	37.000000	4.000000

Code explanation

Line by line, the code does the following (you can skip this part if not interested in technical numpy):
1. set the factor n_sd = 4 (again, changeable to your preference)
2. create a slice data_nums with only numeric columns for easier access
3. get the means of all numeric columns and store them in data_means
4. get the standard deviation of all numeric columns and store them in data_stds
5. slice those in between n_sd standard deviations from the mean. First, the expression (np.abs(data_nums - data_means)/data_stds < n_sd) returns an 2D array of the same shape with data_nums with element being True if the original value is okay, and False if outliers. Call all(axis=1) from the array result in a 1D array where an element is True if the row do not contain outliers, and False otherwise. Finally, using the 1D array in slicing yields the dataframe with all rows having outliers dropped. I would recommend running each part separately and observe their results to truly understand what happens here.

Using median and interquartile range

As we discussed some times ago, outliers influence the mean and may make it misleading as the true center in data. Because standard deviation is calculated using the mean, it gets exaggerated as well. In cases where data is very skewed with big outliers like FamilyIncome, we may prefer using the median and interquartile – IQR based distance. The idea, however, is similar as before. Values that are above a certain number of IQR from the median are considered outliers. The IQR factor is usually from 2.2 to 4.

In terms of code, it is very similar like before, we just change to calculate median and IQR instead of mean and standard deviation. You can just reuse by changing n_iqr and the num_cols list. Here, I use n_iqr of 2.5 (which is actually weaker than n_sd of 4), however, we can see more rows are dropped. The max FamilyIncome is now just a bit below 400k.

In [15]:

n_iqr = 2.5                                                      

data_nums = data[num_cols]
data_med = data_nums.median()                                    
data_iqr = data_nums.quantile(0.75) - data_nums.quantile(0.25)   
data_drp_outl = data[(np.abs(data_nums - data_med)/data_iqr < n_iqr).all(axis=1)]

In [16]:

data_drp_outl.describe()

Out[16]:

	HighSchoolGPA	FamilyIncome	AvgDailyStudyTime	TotalAbsence	FirstYearGPA
count	942.000000	942.000000	942.000000	942.000000	942.000000
mean	3.021964	118544.063694	6.176327	17.233546	2.710679
std	0.488518	87823.709141	2.332453	5.726369	0.546230
min	1.390000	17378.000000	0.000000	1.000000	1.200000
25%	2.670000	49324.750000	4.630000	13.000000	2.372500
50%	3.030000	87877.000000	6.230000	17.000000	2.695000
75%	3.350000	164070.250000	7.780000	21.000000	3.080000
max	4.000000	399778.000000	13.970000	34.000000	4.000000

Dropping fields with outliers

A less extreme way of removing outliers is to replace them with nan instead of deleting the whole row. This method, of course, makes these values become missing which requires another process of handling missing data to be completely fixed. Regardless, it is more acceptable than removing rows, so you can try if interested. Similar like before, we still rely on the mean and standard deviation or median and IQR to determine outliers before marking them as missing. First is an example with mean and standard deviation:

In [25]:

n_sd = 4

data_drp_outl = data.copy()
data_nums = data[num_cols]
data_means = data_nums.mean()
data_stds = data_nums.std()
data_drp_outl[np.abs(data_nums - data_means)/data_stds > n_sd] = np.nan

In [26]:

data_drp_outl.describe()

Out[26]:

	HighSchoolGPA	FamilyIncome	AvgDailyStudyTime	TotalAbsence	FirstYearGPA
count	1000.000000	996.000000	985.000000	990.000000	1000.000000
mean	3.017420	129729.229920	6.132305	17.222222	2.702000
std	0.491055	108286.605765	2.346976	5.785613	0.546332
min	1.390000	17378.000000	0.000000	1.000000	1.200000
25%	2.670000	50218.250000	4.610000	13.000000	2.360000
50%	3.030000	91606.000000	6.180000	17.000000	2.695000
75%	3.340000	173164.250000	7.760000	21.000000	3.072500
max	4.000000	866534.000000	13.970000	37.000000	4.000000

And using median and IQR:

In [27]:

n_iqr = 2.5                                                     

data_drp_outl = data.copy()
data_nums = data[num_cols]
data_med = data_nums.median()                                   
data_iqr = data_nums.quantile(0.75) - data_nums.quantile(0.25)
data_drp_outl[np.abs(data_nums - data_med)/data_iqr > n_iqr] = np.nan

In [28]:

data_drp_outl.describe()

Out[28]:

	HighSchoolGPA	FamilyIncome	AvgDailyStudyTime	TotalAbsence	FirstYearGPA
count	1000.000000	966.000000	985.000000	990.000000	1000.000000
mean	3.017420	118518.401656	6.132305	17.222222	2.702000
std	0.491055	87497.528478	2.346976	5.785613	0.546332
min	1.390000	17378.000000	0.000000	1.000000	1.200000
25%	2.670000	49324.750000	4.610000	13.000000	2.360000
50%	3.030000	88634.000000	6.180000	17.000000	2.695000
75%	3.340000	163885.250000	7.760000	21.000000	3.072500
max	4.000000	399778.000000	13.970000	37.000000	4.000000

Now, we can see, only the number of valid values in FamilyIncome drops in both cases, the rest of the columns are unchanged. There are just a few changes in the codes. First, we now make a copy of the whole dataframe from the beginning since we will directly slice and modify it by the end. Second, the slice changed to filter for outliers instead of regular values so that we can set them to nan.

Clipping outliers

Dropping outliers is not a good way of handling outliers in general. If you are sure outliers come from errors and not natural, you can clip them instead. Clipping means to choose a range, and any values outside of the range are set to the closest boundary. For example, if I have an array of [-100, -10, -5, 4, 8, 10, 14, 1000] and choose a clip range of -10 to 10, the clipped array becomes [-10, -10, -5, 4, 8, 10, 10, 10].

In Python, we use the function clip() from numpy to perform this operation. It takes three inputs, the array or dataframe to clip, the lower bound, and the upper bound. We will also set option axis=1 to clip by rows. Similar like dropping, we can set the range use mean and standard deviation or median and IQR. The example for the students data using the mean and standard deviation is as below.

In [18]:

n_sd = 4

data_clp_outl = data.copy()
data_nums = data[num_cols]
data_means = data_nums.mean() 
data_stds = data_nums.std()     
data_clp_outl[num_cols] = np.clip(data_nums, data_means - n_sd*data_stds, data_means + n_sd*data_stds, axis=1)

In [19]:

data_clp_outl.describe()

Out[19]:

	HighSchoolGPA	FamilyIncome	AvgDailyStudyTime	TotalAbsence	FirstYearGPA
count	1000.000000	1000.000000	985.000000	990.000000	1000.000000
mean	3.017420	132982.706401	6.132305	17.222222	2.702000
std	0.491055	119655.207186	2.346976	5.785613	0.546332
min	1.390000	17378.000000	0.000000	1.000000	1.200000
25%	2.670000	50311.250000	4.610000	13.000000	2.360000
50%	3.030000	92038.000000	6.180000	17.000000	2.695000
75%	3.340000	174144.750000	7.760000	21.000000	3.072500
max	4.000000	943098.350343	13.970000	37.000000	4.000000

Using median and IQR:

In [20]:

n_iqr = 2.5                                                      

data_clp_outl = data.copy()
data_nums = data[num_cols]
data_med = data_nums.median()                                    
data_iqr = data_nums.quantile(0.75) - data_nums.quantile(0.25)
data_clp_outl[num_cols] = np.clip(data_nums, data_med - n_iqr*data_iqr, data_med + n_iqr*data_iqr, axis=1)

In [21]:

data_clp_outl.describe()

Out[21]:

	HighSchoolGPA	FamilyIncome	AvgDailyStudyTime	TotalAbsence	FirstYearGPA
count	1000.000000	1000.000000	985.000000	990.000000	1000.000000
mean	3.017420	128143.915500	6.132305	17.222222	2.702000
std	0.491055	100151.179815	2.346976	5.785613	0.546332
min	1.390000	17378.000000	0.000000	1.000000	1.200000
25%	2.670000	50311.250000	4.610000	13.000000	2.360000
50%	3.030000	92038.000000	6.180000	17.000000	2.695000
75%	3.340000	174144.750000	7.760000	21.000000	3.072500
max	4.000000	401621.750000	13.970000	37.000000	4.000000

In both cases, the dataframe still has 1000 rows after handling outliers. However, we can still see the difference between the two distance methods. The mean approach keep the max FamilyIncome at over 900k, and the median approach 401k. Which one is better? You decide!

Leave them be

Yes, you see it correctly. Just leave them be. This is a valid way of handling outliers too. Especially in data where outliers are expected, naturally occur, and are more than just a few (over 2-3% data is outliers). Furthermore, there are certain transformations like logarithm that will deflate the influences of outliers by a lot, however we will discuss this in another post.

Conclusion

In this post, we have discussed quite a few methods of handling outliers. And just like determining outliers, these are fairly subjective to you as the analysis, and are situational on the data sets as well. So, it is always good to try to understand your data and outliers before attempting to treat them. This post is very long now, so I will see you again, and happy analyzing!

Handling Outliers