At this point, we have obtained a good amount of understanding and hands-on about NumPy arrays and Pandas dataframes. We can now start some analysis. And, the first one that we will try is exploratory analysis. As previously discussed, exploratory analysis involves getting a general idea of what you have in your data set, or what the data distributions are like. Additionally, we can further examine the associations among features. Finally, we should determine whether our data has any issues that needs fixing later on. Now, let us go through the concepts of these terms I have just mentioned.
Distribution analysis
Examining distribution is one of the most common task in exploratory analysis. In my informal definition, data distribution refers to how values in the data allocate in terms of locations and frequencies. For numeric columns, we are usually interested in where the majority of values is, and how they spread out from that area. For categorical columns, we basically can only look at the frequencies of the classes. Sounds a bit familiar? Yes, because we have discussed a few measurements in descriptive statistics like mean, median, model, standard deviation, and variance.
However, those numbers alone are hard to imagine. They are also not enough to describe a distribution. For example, in a GPA data, you see many students have theirs around 3. The number of students decrease gradually as their GPAs deviate from 3. There are a few students whose GPAs are below 2 or above 3.9. Even if we know the GPA’s mean is 3 and its standard deviation 3.6, it is still not enough to represent the transition that I have just described. Sure, there are also measurements like skewness, kurtosis, etc., but they are probably too technical for the level of practical analysis on which this blog focuses. Instead, we can utilize more intuitive tools like figures and charts.
Feature association analysis
Another common task during exploratory analysis, feature association analysis means to determine the correlation between features in your data. As is this a type of analysis that focuses on relationships, it takes input as multiple, usually pairs of, features. Correlations between two numeric columns refer to the degree that they change together, i.e., both increase, or one increases the other decreases, or they change regardless of each other. For examples, GPA may increase as study time increases, while it could decrease if numbers of absence lectures increases.
Correlations that involve categorical columns usually focus on how different the distribution of values in one column in terms of each class in the other. For instances, GPAs of students in different majors may follow distributions with varied means and standard deviations.
Like distributions, we have measurements that can summarize correlations, however, intuitive visualizations are probably better practically.
What kinds of issues
In exploratory analysis, we also try to determine issues in data which can be one of several things. First, outliers are extreme values in your data that are not where they should be, and very few instances have such values. Back to the example of GPA, outliers could be students that have GPA below 0.5 or above 4. There are just a few of them, and their GPAs are very different with the rest of the students.
Other common issues that we seek during exploratory analysis are missing values and coded values. Missing values are values that are just not there in the data – all those NaN
that we saw while working with NumPy and Pandas. In the students data, some may fail to report their GPA, so, you do not have that information for them. You now have missing values in students’ GPAs.
Coded values, on the other hand, are values representing meanings that are different from usual. For example, GPAs of students who refused to answer may be coded as 99, GPAs of students who you could not contact may be coded as 98, etc. So they are there and they carry meanings, just not the usual 0-4 meaning of GPA.
What’s next?
This post describes some common things that we look at during an exploratory analysis. So, of course, in the next few ones, I will go into details of how to do each type of analysis with all the tools we have learned. So, stay tuned, and see you again!
Pingback: Distribution Analysis - Data Science from a Practical Perspective
Pingback: Correlation Analysis on Two Numeric Columns - Data Science from a Practical Perspective
Pingback: Common Data Issues - Data Science from a Practical Perspective
Pingback: Handling Outliers - Data Science from a Practical Perspective