Handle Missing Data

an illustration of handle missing data with different methods: constants, nearest neighbors, and regression

Missing data is prevalent in analytics. They are fields in your data without a valid value, and they must be addressed. Otherwise, most analytical models would omit data that has missing values. Furthermore, depending on the libraries you use, models may straight up not run and throw a bunch of errors at you. So, in this post, let us explore ways to handle missing data.

Data in demonstration

For demonstration purpose, I have modified the students data to have more missing values can call it students_missing.csv. You can also find the Jupyer notebook here. Now back to codes, we first import pandas, numpy, and matplotlib.pyplot right from the beginning. Then, we check info() and histograms to have a reference of the original data.

The results of info() show that we have a few different levels of missing data here. State, FamilyIncome, and HighSchoolGPA have about 2-3% missing, AvgDailyStudyTime over 6%, and TotalAbsence 18%. These numbers are quite important to choose methods to handle missing as we will see next.

Dropping rows with missing data

Just like with outliers, “don’t like, don’t use” is certainly an option to handle missing data. And also just like with outliers, this option is not a good one most of the times. First, we lose data, and maybe a lot at that. Second, missing data are there for a reason. They may come from totally random events like mistakes in data storing or collecting which are fine to remove. However, they may also relate to patterns in your data in which case removing them will bias your analysis. For example, in survey data, some people refuse to answer some questions because of some underlying reason. In this case, missing values correlates to a specific problem, and removing them flaws in the analysis.

Even when missing values are completely random, removing them can still cause issues like losing a good chunk of data. For example, in the students data, we can just use one simple call to dropna() from the dataframe with option axis=0 to drop rows. Look at the result though, we end up with just 701, meaning that you have just lost 30% of your data!

Data is valuable, so let us try to salvage it instead of throwing away. For that reason, next, we discuss imputation.

Handle missing data by imputation

Imputation refers to the process of filling missing values in data. What to fill in those fields though? There are several ways, so let us go through them one by one.

Imputation with constants

For categorical data

One fairly common way is to select a “good” value in the data to fill in missing places. In categorical data, it is super easy. You can simply create a new class, called it missing or NA or whatever, and place that on to all missing fields in the columns. In Pandas, we can call fillna() from the dataframe, and put the value for the new missing class. For example, in the students data, we perform the imputation as below. You can see that after fillna(), State no longer has missing values. Upon checking its frequency table, we also see a new class missing with count of 18.

We can also use values like the most frequent class (the mode) to fill in missing fields. However, I personally do not like this method since the assumption is a bit too strong.

For numeric data

For numeric data, measurement of central tendency like mean or median are good values. However, like we discussed a few times already, the mean is influenced by outliers and could be misleading. Therefore, a very common filling method you will see is median imputation. In short, we just use the median of a column to replace all of its missing values. This can be done with fillna() like previously. In the below example, we use the filling value as the columns medians that are calculated with data_imp_num[num_cols].median(). Then we pull up info() and histograms for some result evaluation.

First, we can see that, non-null counts of all numeric columns are now 1000, indicating no missing values left. Looking at the histograms though, we may see some issues. All columns get a higher peak at the median locations, which is understandable, because we use the median to fill missing. However, while the new peaks for HighSchoolGPA and FamilyIncome are okay, that of AvgDailyStudyTime and TotalAbsence are just too high. In the case of TotalAbsence, the distribution looks so different from before.

So, can you guess the issue of median imputation now? It gets worse as the proportion of missing data increases. A column with 2-3% missing can be imputed just fine, as we see with HighSchoolGPA and FamilyIncome. 6% like AvgDailyStudyTime looks a bit odd, but acceptable. However, TotalAbsence just straight up change to a different distribution, which is highly undesirable. So, my suggestion is, while using median is safe and will not bias data, you should watch for how much data is being imputed. I personally would not use this method for columns with over 10% missing.

Imputation with models

Using constants is just not good if the proportion of missing data is considerable. In such cases, we may opt to using imputation models to handle missing data. These are quite advance methods, so we need a specialized library which is scikit-learn. We will discuss regression imputation and nearest neighbor imputation.

Regression imputation

Regression means to use data from some column to infer the value of a target. In regression imputation, we build a model with the column having missing data as target. Any missing fields will are replaced by the predictions made by that model. Below is an illustrative example of regression imputation. Assuming GPA is the column with a missing value. This method first builds a regression model that predicts GPA using StudyTime, SleepTime, and TotalAbsence based on the valid portion of data. Then, the model predicts and fills the missing GPA of the missing row using its other available information. We will discuss regression in much more details in a future post.

an illustration of handle missing data with regression models

Back to Python, we need to import two modules for this imputation since it is still in experiments in SKLearn. The main class for this method is IterativeImputer. Using models from SKLearn is very easy though. We first create an empty model, I call it imputer. Then, we call fit_transform() to both train and apply it simultaneously on the data to obtain the result. One small note is that regression imputation is only for numeric data, so we need to slice those columns only. Also, the result of SKLearn models are no longer Pandas dataframe but just NumPy arrays, so we need to manually change them back to use dataframe functions. Now, we can see all missing get filled, and the distributions stay pretty much the same for all columns.

Nearest neighbor imputation

an illustration of nearest neighbor

In data analytics, nearest neighbors of an instances are those the closest to it. How to define “close”? There are a lot of ways, but we will use the simplest one in this case, “close” simply means having small Euclidean distances. Instead of talking math, just imagine having a scatter plot for your data (like the one above). From one dot, select three others that are the closest to it, you have obtained three nearest neighbors of that instance. While we cannot have scatter plot of more than three features, the concept is the same. Nearest neighbors are those with the smallest “geographical” distance to an instance.

So how are nearest neighbors related to imputation? Well, the idea is that instances that are closer to each other tend to be more similar. For example, people in the same neighborhood tend to have similar income, demography, etc. So, if someone knows some of your neighbors’ income, they can more or less guess yours with some accuracy. Back to analytics, the same concept applies. A missing field of an instance can be guessed using its nearest neighbors’ available information, for example, by averaging, or averaging with distances as weights.

an illustration of handle missing data with nearest neighbors method
Nearest Neighbor Imputation with SKLearn

In Python, we still use SKLearn for this method. The model class is now KNNImputer. Like before, we first create an empty model. This time, we have a new input n_neighbors which is the number of nearest neighbors you want to use to guess the missing values. In this example, I set it to 10. Having the empty model, we then train and apply it with fit_transform() just like before. The final distributions look quite nice, at least compared to median methods.

One important note when using this approach is that finding nearest neighbors is very sensitive to differences in columns’ scales. In this data set, FamilyIncome with the highest range will dominate all other columns when finding neighbors of instances. For such reason, we usually need to scale data before this step. Do not worry though, I will discuss scaling right in the next post!

Wrapping up

In this post, we have discussed several methods to handle missing data. To sum up, you should not drop data with missing but rather imputing them. Categorical data is easy as you can just create a new class for the missing values. For numeric data, you can use median safely when the missing proportion is small. If it gets high, regression or nearest neighbors imputations are probably better. However, beware of these two methods as well, because they inflate the correlations in your data. Now, this post is quite longer than I want it to be, so I would stop here. See you again next time!

2 Comments

Comments are closed