Sun Tzu once said “know your data, know your models, a hundred analyses, a hundred wins”, or something along that line. See, people in the mediaeval times knew the importance of understanding data before any analysis, and so should you. And, the first thing to do to understand your data, is to analyze their distributions. So, in this post, let us go through basic tools that we can use in distribution analysis. One small note is that, we will mainly stick to the practical tools like simple measurements or visualizations instead of going deep into statistical and technical stuffs. So, let us start!
What is distribution analysis?
As I have discussed previously, distribution analysis means to examine the allocations of values in data in terms of locations and frequencies for each feature. For numerical columns (again, features and columns are interchangeable), we mainly look at where the majority of values are, and how they disperse from such locations. The manner of dispersions is also important: gradually or rapidly, balance to all sides or just one side, etc. For categorical columns, we can really only check their frequencies in each class, though there are still useful information there.
Distribution analysis informs you on how to process data for more advance modeling. Issues like outliers, missing values, and coded values, are also likely determined from this step, though we will leave it to another post to really discuss them in details. Up next, we will look at tools to analyze distributions of numerical and categorical data.
Tools in distribution analysis
Descriptive statistics
To begin with, we can use descriptive statistics that include mean, median, standard deviation, variance, etc., for numeric columns. We have already discussed the concepts of these measurement. The mean and median describe where the center of the data is, and the standard deviation and variance describe how the values disperse from the mean. For examples, take a look at some descriptive statistics of some GPA data. We can observe that the majority of GPA values are from 2.1 to 3.3, centered around 2.8; some GPAs are from 1.5 to 2.1 and above 3.3; and a small number below 1.5. This could suggest a distribution where values centering around the mean then gradually disperse to both sides.
Statistics | Value |
Minimum | 0.1 |
Mean | 2.7 |
Median | 2.8 |
Maximum | 4.0 |
Standard deviaion | 0.61 |
GPA Range | Frequency |
0 – 1.5 | 12 |
1.5 – 2.1 | 36 |
2.1 – 2.7 | 83 |
2.7 – 3.3 | 95 |
3.3 – 4 | 46 |
So, knowing these numbers does give us some idea on how data distributed. But as you can see, numbers are sometimes difficult to imagine.
Visualization for numeric distributions
Instead of just looking at numbers, we can certainly visualize data distributions. Histograms and boxplots are probably among the most frequently used types of visualizations for numerical distributions. They nicely illustrate all the needed information like center, dispersion degree, dispersion manner, and extreme values. Examples of histograms and box plots are as below. Boxplots may need a bit more explanation, but I think histograms are very intuitive and you may have had some ideas about what it is showing already. Regardless, we will spend a post discussing them.
Distribution of categorical data
Finally, for categorical data, we have frequency tables, pie charts, and bar charts. Categorical data is (most of the times) not as complicated as numerical ones, so their pool of tools for distribution analysis is not as prevalent. Regardless, they are important to understand and are worth their own post. I will show a frequency table and a its bar chart as examples.
State | Count |
GA | 106 |
TN | 37 |
AL | 23 |
FL | 22 |
NY | 6 |
WA | 6 |
Wrapping up
In this post, we have reviewed the concept of distribution analysis in more details. We have also learned some initial ideas on tools to use in this step. Now, the next thing to do is to learn how to use them right? So, for the next few posts, we will discuss these tools as well as try hands-on analysis.
See you again!
Pingback: Analyzing Numeric Distributions - Data Science from a Practical Perspective