So far, we have discussed distribution analysis and correlation analysis when initially exploring data. One important task in this phase is to determine whether there is anything that will cause problems in later phases of your analysis. And just like people, every data set has its own quirks. Nevertheless, there are some common data issues that more likely show up which are outliers, missing values, rare values, and coded values. So, in this post, I will discuss some ways to detect them in data. Handling them will be the topic for another post on preprocessing data.
Outliers
Outliers are values that are very different from the rest of the data and can bias your analysis. Also, the outlier concept is usually applied in numerical data. For categorical data, they are more of rare values discussed later in this post. Now you may ask, how much different is “very”? Actually, it is up to you! As it turns out, defining outliers is fairly subjective on the analysts. A quick and easy way is to use the histogram. If you see a histogram going far to on side while seemingly having no data there, that is a sign of outliers. The reason is that outliers are very different form regular data which makes them very distant from the data center. And, their frequency is low so that it barely shows up on the plot.
Below is one such examples. You can see that most of values in this data lies between 0
and 0.4
to 0.5
, and some very small portions go up to 0.7
. However, the horizontal axis extends to 1.0
while there do not seem to be anything there. So, we can be certain that there are outliers around that area. It is also safe to considers values around 0.8
outliers. However, when it comes to those around 0.6
, you get to decide by yourself.
Missing values
Missing values are also among the common data issues. They are problematic because most analysis cannot utilize rows with anything missing. Unlike outliers, however, they are really easy to detect. The info()
function from a dataframe will give you exact numbers of how many values in each column are missing. Just take the total number of entries and subtract it by the non-null count of each column. For example, in the students1000.csv
data, info()
gives the result below. The total number of entries is 1000
, so the first six columns have no missing values. State
has 1000-982=18
missing values, AvgDailyStudyTime
15
, and TotalAbsence
10
. Easy enough, right? Handling them is more troublesome though. However, it is also a story for another day.
students.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 StudentID 1000 non-null int64 1 FirstName 1000 non-null object 2 LastName 1000 non-null object 3 Major 1000 non-null object 4 HighSchoolGPA 1000 non-null float64 5 FamilyIncome 1000 non-null int64 6 State 982 non-null object 7 AvgDailyStudyTime 985 non-null float64 8 TotalAbsence 990 non-null float64 9 FirstYearGPA 1000 non-null float64 dtypes: float64(4), int64(2), object(4) memory usage: 78.2+ KB
Rare values
Rare values are like outliers, but for categorical data. The issue with rare values is that they are too low in frequency to be representative of a group. For example, if you know 200 people from New York, you may have a general idea of how an average New Yorker is. But, if you only know one or two of them, you cannot conclude anything, right? Now, how low a frequency should be for it to be rare? Like in outliers, you probably have to make this decision. To detect them, just look at the bar chart and find classes with very low or barely showing-up bars. One example is below. You can probably see which are the rare classes in this case immediately.
Coded values
Coded values are another one of common data issues. In short, codes are values that look normal, at least in terms of format, but they carry different meaning. For examples, in an age
column in a survey data, we may have regular values from 0
to 100
, however, there are some 999
as well. These 999
‘s are surely not people age, but they are numbers, and they are in the column. So, they are likely codes for things like “not available” or “refused to answer”. Coded values are more troublesome than outliers because their values are usually very extreme to not overlap regular values. Furthermore, they are numerically incorrect. And, their frequency could be much higher than that of outliers. So, if we do not handle them correctly, coded values may severely bias our analysis.
So, how to detect coded values? The first thing, and I really mean it, is to read the data documentation if there are any. Any codes should be listed and explained there, if your data comes from a good source. In the rare case that you do not have a documentation, histogram is a good way to go. They may appear as a sudden peak at either end of the histogram. Again, codes are usually selected so that they do not overlap regular values, so they are the most extreme. Their frequency could be a bit high, so they do appear in histogram now. One example is as below. If you see such a peak, it is fairly certain that such values are peaks instead of outliers.
Another thing you can do is to pull up a frequency table for that particular range. If all of them carry the same value (or the same few values), they are most definitely codes instead of actual data.
Wrong values
Yes, you see it correctly. You may have wrong values in data. They may come from any reasons, but they are incorrect, and you must detect them. The thing is, there are not really any rules here. You just need to look a bit closer into your data and try to see if there are anything unusual in the values (besides outliers and codes). Still, any tools that we have learned may come in handy. One example, in the frequency table below, we can see two categories for other
, one with lowercase o
and one with uppercase O
. They are not exactly wrong, but having two categories for the same thing is definitely not correct.
Purpose | Count |
---|---|
Debt Consolidation | 78552 |
other | 6037 |
Home Improvements | 5839 |
Other | 3250 |
Business Loan | 1569 |
Buy a Car | 1265 |
Medical Bills | 1127 |
Buy House | 678 |
Take a Trip | 573 |
major_purchase | 352 |
small_business | 283 |
moving | 150 |
wedding | 115 |
Another example, I have a histogram of credit scores with values much higher than 1000
. If you do not know, credit scores in the US is capped at 850
. So, those around 6000
to 7000
are definitely having problems.
Or, I have seen patient data with heart rates of 0
or cholesterol levels of 0
. So really, these kinds of issues can be anything. You just need to carefully do your exploratory analysis and keep a keen eye for them.
Conclusion
In this post, I have discussed several types of common issues in data that you should pay attentions. They are, by no means, all of the issues, but they are there often enough for you to check in any data you see. So, keep yourself vigilant, and happy analyzing!
Pingback: Handling Outliers - Data Science from a Practical Perspective