an illustration of handle missing data with different methods: constants, nearest neighbors, and regression

Missing data is prevalent in analytics. They are fields in your data without a valid value, and they must be addressed. Otherwise, most analytical models would omit data that has missing values. Furthermore, depending on the libraries you use, models may straight up not run and throw a bunch of errors at you. So, in this post, let us explore ways to handle missing data.

Data in demonstration

For demonstration purpose, I have modified the students data to have more missing values can call it students_missing.csv. You can also find the Jupyer notebook here. Now back to codes, we first import pandas, numpy, and matplotlib.pyplot right from the beginning. Then, we check info() and histograms to have a reference of the original data.

students-missing.csv Download

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('students-missing.csv')
data = data.drop('StudentID', axis=1)

In [2]:

#get missing information
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   FirstName          1000 non-null   object 
 1   LastName           1000 non-null   object 
 2   Major              1000 non-null   object 
 3   HighSchoolGPA      974 non-null    float64
 4   FamilyIncome       960 non-null    float64
 5   State              982 non-null    object 
 6   AvgDailyStudyTime  937 non-null    float64
 7   TotalAbsence       820 non-null    float64
dtypes: float64(4), object(4)
memory usage: 62.6+ KB

In [4]:

#draw histograms
data.hist(bins=20, figsize=(8,8))
plt.show()

The results of info() show that we have a few different levels of missing data here. State, FamilyIncome, and HighSchoolGPA have about 2-3% missing, AvgDailyStudyTime over 6%, and TotalAbsence 18%. These numbers are quite important to choose methods to handle missing as we will see next.

Dropping rows with missing data

Just like with outliers, “don’t like, don’t use” is certainly an option to handle missing data. And also just like with outliers, this option is not a good one most of the times. First, we lose data, and maybe a lot at that. Second, missing data are there for a reason. They may come from totally random events like mistakes in data storing or collecting which are fine to remove. However, they may also relate to patterns in your data in which case removing them will bias your analysis. For example, in survey data, some people refuse to answer some questions because of some underlying reason. In this case, missing values correlates to a specific problem, and removing them flaws in the analysis.

Even when missing values are completely random, removing them can still cause issues like losing a good chunk of data. For example, in the students data, we can just use one simple call to dropna() from the dataframe with option axis=0 to drop rows. Look at the result though, we end up with just 701, meaning that you have just lost 30% of your data!

In [5]:

data_drp_na = data.dropna(axis=0)
data_drp_na.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 701 entries, 0 to 999
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   FirstName          701 non-null    object 
 1   LastName           701 non-null    object 
 2   Major              701 non-null    object 
 3   HighSchoolGPA      701 non-null    float64
 4   FamilyIncome       701 non-null    float64
 5   State              701 non-null    object 
 6   AvgDailyStudyTime  701 non-null    float64
 7   TotalAbsence       701 non-null    float64
dtypes: float64(4), object(4)
memory usage: 49.3+ KB

Data is valuable, so let us try to salvage it instead of throwing away. For that reason, next, we discuss imputation.

Handle missing data by imputation

Imputation refers to the process of filling missing values in data. What to fill in those fields though? There are several ways, so let us go through them one by one.

Imputation with constants

For categorical data

One fairly common way is to select a “good” value in the data to fill in missing places. In categorical data, it is super easy. You can simply create a new class, called it missing or NA or whatever, and place that on to all missing fields in the columns. In Pandas, we can call fillna() from the dataframe, and put the value for the new missing class. For example, in the students data, we perform the imputation as below. You can see that after fillna(), State no longer has missing values. Upon checking its frequency table, we also see a new class missing with count of 18.

In [10]:

cat_cols = ['Major','State']

In [11]:

data_imp_cat = data.copy()
data_imp_cat[cat_cols] = data_imp_cat[cat_cols].fillna('missing')

In [12]:

data_imp_cat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   FirstName          1000 non-null   object 
 1   LastName           1000 non-null   object 
 2   Major              1000 non-null   object 
 3   HighSchoolGPA      974 non-null    float64
 4   FamilyIncome       960 non-null    float64
 5   State              1000 non-null   object 
 6   AvgDailyStudyTime  937 non-null    float64
 7   TotalAbsence       820 non-null    float64
dtypes: float64(4), object(4)
memory usage: 62.6+ KB

In [13]:

for col in cat_cols:
    print(col)
    print(data_imp_cat[col].value_counts())
    print('---')

Major
Information Technology    281
Computer Science          259
Software Engineering      249
Data Science              211
Name: Major, dtype: int64
---
State
GA         544
SC         137
FL         117
AL         102
TN          74
missing     18
WA           5
NY           3
Name: State, dtype: int64
---

We can also use values like the most frequent class (the mode) to fill in missing fields. However, I personally do not like this method since the assumption is a bit too strong.

For numeric data

For numeric data, measurement of central tendency like mean or median are good values. However, like we discussed a few times already, the mean is influenced by outliers and could be misleading. Therefore, a very common filling method you will see is median imputation. In short, we just use the median of a column to replace all of its missing values. This can be done with fillna() like previously. In the below example, we use the filling value as the columns medians that are calculated with data_imp_num[num_cols].median(). Then we pull up info() and histograms for some result evaluation.

In [6]:

num_cols = ['HighSchoolGPA','FamilyIncome','AvgDailyStudyTime','TotalAbsence']

In [7]:

data_imp_num = data.copy()
data_imp_num[num_cols] = data_imp_num[num_cols].fillna(data_imp_num[num_cols].median())

In [8]:

data_imp_num.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   FirstName          1000 non-null   object 
 1   LastName           1000 non-null   object 
 2   Major              1000 non-null   object 
 3   HighSchoolGPA      1000 non-null   float64
 4   FamilyIncome       1000 non-null   float64
 5   State              982 non-null    object 
 6   AvgDailyStudyTime  1000 non-null   float64
 7   TotalAbsence       1000 non-null   float64
dtypes: float64(4), object(4)
memory usage: 62.6+ KB

In [9]:

data_imp_num.hist(bins=20, figsize=(8,8))
plt.show()

First, we can see that, non-null counts of all numeric columns are now 1000, indicating no missing values left. Looking at the histograms though, we may see some issues. All columns get a higher peak at the median locations, which is understandable, because we use the median to fill missing. However, while the new peaks for HighSchoolGPA and FamilyIncome are okay, that of AvgDailyStudyTime and TotalAbsence are just too high. In the case of TotalAbsence, the distribution looks so different from before.

So, can you guess the issue of median imputation now? It gets worse as the proportion of missing data increases. A column with 2-3% missing can be imputed just fine, as we see with HighSchoolGPA and FamilyIncome. 6% like AvgDailyStudyTime looks a bit odd, but acceptable. However, TotalAbsence just straight up change to a different distribution, which is highly undesirable. So, my suggestion is, while using median is safe and will not bias data, you should watch for how much data is being imputed. I personally would not use this method for columns with over 10% missing.

Imputation with models

Using constants is just not good if the proportion of missing data is considerable. In such cases, we may opt to using imputation models to handle missing data. These are quite advance methods, so we need a specialized library which is scikit-learn. We will discuss regression imputation and nearest neighbor imputation.

Regression imputation

Regression means to use data from some column to infer the value of a target. In regression imputation, we build a model with the column having missing data as target. Any missing fields will are replaced by the predictions made by that model. Below is an illustrative example of regression imputation. Assuming GPA is the column with a missing value. This method first builds a regression model that predicts GPA using StudyTime, SleepTime, and TotalAbsence based on the valid portion of data. Then, the model predicts and fills the missing GPA of the missing row using its other available information. We will discuss regression in much more details in a future post.

an illustration of handle missing data with regression models

Back to Python, we need to import two modules for this imputation since it is still in experiments in SKLearn. The main class for this method is IterativeImputer. Using models from SKLearn is very easy though. We first create an empty model, I call it imputer. Then, we call fit_transform() to both train and apply it simultaneously on the data to obtain the result. One small note is that regression imputation is only for numeric data, so we need to slice those columns only. Also, the result of SKLearn models are no longer Pandas dataframe but just NumPy arrays, so we need to manually change them back to use dataframe functions. Now, we can see all missing get filled, and the distributions stay pretty much the same for all columns.

In [14]:

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imputer = IterativeImputer(max_iter=10)
data_imp_num = imputer.fit_transform(data[num_cols])

In [15]:

pd.DataFrame(data_imp_num).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       1000 non-null   float64
 1   1       1000 non-null   float64
 2   2       1000 non-null   float64
 3   3       1000 non-null   float64
dtypes: float64(4)
memory usage: 31.4 KB

In [17]:

pd.DataFrame(data_imp_num).hist(bins=20, figsize=(8,8))
plt.show()

Nearest neighbor imputation

In data analytics, nearest neighbors of an instances are those the closest to it. How to define “close”? There are a lot of ways, but we will use the simplest one in this case, “close” simply means having small Euclidean distances. Instead of talking math, just imagine having a scatter plot for your data (like the one above). From one dot, select three others that are the closest to it, you have obtained three nearest neighbors of that instance. While we cannot have scatter plot of more than three features, the concept is the same. Nearest neighbors are those with the smallest “geographical” distance to an instance.

So how are nearest neighbors related to imputation? Well, the idea is that instances that are closer to each other tend to be more similar. For example, people in the same neighborhood tend to have similar income, demography, etc. So, if someone knows some of your neighbors’ income, they can more or less guess yours with some accuracy. Back to analytics, the same concept applies. A missing field of an instance can be guessed using its nearest neighbors’ available information, for example, by averaging, or averaging with distances as weights.

an illustration of handle missing data with nearest neighbors method

Nearest Neighbor Imputation with SKLearn

In Python, we still use SKLearn for this method. The model class is now KNNImputer. Like before, we first create an empty model. This time, we have a new input n_neighbors which is the number of nearest neighbors you want to use to guess the missing values. In this example, I set it to 10. Having the empty model, we then train and apply it with fit_transform() just like before. The final distributions look quite nice, at least compared to median methods.

In [18]:

from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=10)
data_imp_num = imputer.fit_transform(data[num_cols])

In [19]:

pd.DataFrame(data_imp_num).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       1000 non-null   float64
 1   1       1000 non-null   float64
 2   2       1000 non-null   float64
 3   3       1000 non-null   float64
dtypes: float64(4)
memory usage: 31.4 KB

In [20]:

pd.DataFrame(data_imp_num).hist(bins=20, figsize=(8,8))
plt.show()

One important note when using this approach is that finding nearest neighbors is very sensitive to differences in columns’ scales. In this data set, FamilyIncome with the highest range will dominate all other columns when finding neighbors of instances. For such reason, we usually need to scale data before this step. Do not worry though, I will discuss scaling right in the next post!

Wrapping up

In this post, we have discussed several methods to handle missing data. To sum up, you should not drop data with missing but rather imputing them. Categorical data is easy as you can just create a new class for the missing values. For numeric data, you can use median safely when the missing proportion is small. If it gets high, regression or nearest neighbors imputations are probably better. However, beware of these two methods as well, because they inflate the correlations in your data. Now, this post is quite longer than I want it to be, so I would stop here. See you again next time!

Handle Missing Data