So far, we have discussed distribution analysis and correlation analysis when initially exploring data. One important task in this phase is to determine whether there is anything that will cause problems in later phases of your analysis. And just like people, every data set has its own quirks. Nevertheless, there are some common data issues that more likely show up which are outliers, missing values, rare values, and coded values. So, in this post, I will discuss some ways to detect them in data. Handling them will be the topic for another post on preprocessing data.

Outliers

Outliers are values that are very different from the rest of the data and can bias your analysis. Also, the outlier concept is usually applied in numerical data. For categorical data, they are more of rare values discussed later in this post. Now you may ask, how much different is “very”? Actually, it is up to you! As it turns out, defining outliers is fairly subjective on the analysts. A quick and easy way is to use the histogram. If you see a histogram going far to on side while seemingly having no data there, that is a sign of outliers. The reason is that outliers are very different form regular data which makes them very distant from the data center. And, their frequency is low so that it barely shows up on the plot.

Below is one such examples. You can see that most of values in this data lies between 0 and 0.4 to 0.5, and some very small portions go up to 0.7. However, the horizontal axis extends to 1.0 while there do not seem to be anything there. So, we can be certain that there are outliers around that area. It is also safe to considers values around 0.8 outliers. However, when it comes to those around 0.6, you get to decide by yourself.

a histogram with outliers, one of the common data issues

Missing values

Missing values are also among the common data issues. They are problematic because most analysis cannot utilize rows with anything missing. Unlike outliers, however, they are really easy to detect. The info() function from a dataframe will give you exact numbers of how many values in each column are missing. Just take the total number of entries and subtract it by the non-null count of each column. For example, in the students1000.csv data, info() gives the result below. The total number of entries is 1000, so the first six columns have no missing values. State has 1000-982=18 missing values, AvgDailyStudyTime 15, and TotalAbsence 10. Easy enough, right? Handling them is more troublesome though. However, it is also a story for another day.

In [2]:

students.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   StudentID          1000 non-null   int64  
 1   FirstName          1000 non-null   object 
 2   LastName           1000 non-null   object 
 3   Major              1000 non-null   object 
 4   HighSchoolGPA      1000 non-null   float64
 5   FamilyIncome       1000 non-null   int64  
 6   State              982 non-null    object 
 7   AvgDailyStudyTime  985 non-null    float64
 8   TotalAbsence       990 non-null    float64
 9   FirstYearGPA       1000 non-null   float64
dtypes: float64(4), int64(2), object(4)
memory usage: 78.2+ KB

Rare values

Rare values are like outliers, but for categorical data. The issue with rare values is that they are too low in frequency to be representative of a group. For example, if you know 200 people from New York, you may have a general idea of how an average New Yorker is. But, if you only know one or two of them, you cannot conclude anything, right? Now, how low a frequency should be for it to be rare? Like in outliers, you probably have to make this decision. To detect them, just look at the bar chart and find classes with very low or barely showing-up bars. One example is below. You can probably see which are the rare classes in this case immediately.

Coded values

Coded values are another one of common data issues. In short, codes are values that look normal, at least in terms of format, but they carry different meaning. For examples, in an age column in a survey data, we may have regular values from 0 to 100, however, there are some 999 as well. These 999‘s are surely not people age, but they are numbers, and they are in the column. So, they are likely codes for things like “not available” or “refused to answer”. Coded values are more troublesome than outliers because their values are usually very extreme to not overlap regular values. Furthermore, they are numerically incorrect. And, their frequency could be much higher than that of outliers. So, if we do not handle them correctly, coded values may severely bias our analysis.

So, how to detect coded values? The first thing, and I really mean it, is to read the data documentation if there are any. Any codes should be listed and explained there, if your data comes from a good source. In the rare case that you do not have a documentation, histogram is a good way to go. They may appear as a sudden peak at either end of the histogram. Again, codes are usually selected so that they do not overlap regular values, so they are the most extreme. Their frequency could be a bit high, so they do appear in histogram now. One example is as below. If you see such a peak, it is fairly certain that such values are peaks instead of outliers.

Another thing you can do is to pull up a frequency table for that particular range. If all of them carry the same value (or the same few values), they are most definitely codes instead of actual data.

Wrong values

Yes, you see it correctly. You may have wrong values in data. They may come from any reasons, but they are incorrect, and you must detect them. The thing is, there are not really any rules here. You just need to look a bit closer into your data and try to see if there are anything unusual in the values (besides outliers and codes). Still, any tools that we have learned may come in handy. One example, in the frequency table below, we can see two categories for other, one with lowercase o and one with uppercase O. They are not exactly wrong, but having two categories for the same thing is definitely not correct.

Out[16]:

Purpose	Count
Debt Consolidation	78552
other	6037
Home Improvements	5839
Other	3250
Business Loan	1569
Buy a Car	1265
Medical Bills	1127
Buy House	678
Take a Trip	573
major_purchase	352
small_business	283
moving	150
wedding	115

Another example, I have a histogram of credit scores with values much higher than 1000. If you do not know, credit scores in the US is capped at 850. So, those around 6000 to 7000 are definitely having problems.

a histogram with wrong values of credit scores

Or, I have seen patient data with heart rates of 0 or cholesterol levels of 0. So really, these kinds of issues can be anything. You just need to carefully do your exploratory analysis and keep a keen eye for them.

Conclusion

In this post, I have discussed several types of common issues in data that you should pay attentions. They are, by no means, all of the issues, but they are there often enough for you to check in any data you see. So, keep yourself vigilant, and happy analyzing!

Common Data Issues