an illustration of tools to analyze numerical distribution such as descriptive statistics, histograms, and boxplots

With an overview understanding about distribution analysis, let us actually perform those, starting with numerical data. Obviously, we will be using a mixture of Pandas and Matplotlib – a powerful Python package for visualization. Like previously introduced, we utilize descriptive statistics and certain types of charts to represent numeric distributions. So, let us start!

Data for demonstration

Throughout this post, I will use the students1000.csv data. The complete notebook is available here. Similar to the other students data we have been using, this one consists of information about students and their GPAs. I also added a few columns to cover more situations that we may see during analysis. Like usual, we start the session with importing pandas and read the data. We then perform an info() to get some more information on the columns.

students1000.csv Download

In [3]:

import pandas as pd

students = pd.read_csv('students1000.csv')
students.head(n=3)

Out[3]:

	StudentID	FirstName	LastName	Major	HighSchoolGPA	FamilyIncome	State	AvgDailyStudyTime	TotalAbsence	FirstYearGPA
0	202303595	Baxter	Dengler	Computer Science	2.82	45013	WA	2.01	14.0	1.93
1	202309162	Christian	Wickey	Data Science	3.07	128358	GA	5.41	NaN	2.76
2	202306337	Lonnie	Wulff	Software Engineering	2.68	112392	GA	9.57	13.0	3.09

In [2]:

students.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   StudentID          1000 non-null   int64  
 1   FirstName          1000 non-null   object 
 2   LastName           1000 non-null   object 
 3   Major              1000 non-null   object 
 4   HighSchoolGPA      1000 non-null   float64
 5   FamilyIncome       1000 non-null   int64  
 6   State              982 non-null    object 
 7   AvgDailyStudyTime  985 non-null    float64
 8   TotalAbsence       990 non-null    float64
 9   FirstYearGPA       1000 non-null   float64
dtypes: float64(4), int64(2), object(4)
memory usage: 78.2+ KB

All columns are in their correct data types – categorical and ID columns are object, and numerical columns are int64 or float64. However, StudentID is an ID column and should not be treated as numeric. We will most likely drop it later, but for now, let us change it to object.

In [5]:

students['StudentID'] = students['StudentID'].astype('object')
students['StudentID'].dtype

Out[5]:

dtype('O')

Now, we should be ready to move to the next part.

`describe()` on numeric distributions

The first thing that we look at in numerical columns is their descriptive statistics. Luckily, Pandas provides a dataframe function describe() that does exactly that for us. It generates a table that summarizes the main statistics for all numeric columns. Note that my first statement is just to print float numbers with two decimal digits for easier readability.

In [7]:

pd.set_option('display.float_format', lambda x: '%.2f' % x)

students.describe()

Out[7]:

	HighSchoolGPA	FamilyIncome	AvgDailyStudyTime	TotalAbsence	FirstYearGPA
count	1000.00	1000.00	985.00	990.00	1000.00
mean	3.02	139345.95	6.13	17.22	2.70
std	0.49	200938.10	2.35	5.79	0.55
min	1.39	17378.00	0.00	1.00	1.20
25%	2.67	50311.25	4.61	13.00	2.36
50%	3.03	92038.00	6.18	17.00	2.70
75%	3.34	174144.75	7.76	21.00	3.07
max	4.00	4125854.00	13.97	37.00	4.00

Interpreting `describe()` results

The default result of describe is a table with the columns being features in your data, and rows representing the statistics count, mean, standard deviation std, minimum min, 25%, 50% which is the median, 75%, and maximum max.

And what are the 25% and 75%? They are the 25th and 75th percentile, which are calculated similarly to the median, but instead of taking the middle point, we take the 25% point and 75% point. The difference 75th percentile – 25th percentile is called the Inter-Quartile Range (IQR) of the data, and is a measurement of dispersion like standard deviation. By the way, the min is the 0% and the max 100%.

There are a few things to look at here:
– mean compared to std. If std is much higher than mean, the distributions could be very skewed or having some problems. A skewed distribution means the ways values disperse to two sides from the center are different and not symmetrical. Here, we have one such column which is FamilyIncome. The rest seems okay.
– mean compared to median. If mean is very different from median, the distribution could be skewed. From the describe() result, FamilyIncome shows sign of skewness, other columns are good on this.
– min and max compared to mean, 25%, 50% and 75%. If min or max is very different from the other four, there are potential skewness and/or outliers. In this data, FamilyIncome has a very high max while HighSchoolGPA has a relatively low min.

Overall, the result suggests that FirstYearGPA, AvgDailyStudyTime, and TotalAbsence seems to have a symmetric distribution; HighSchoolGPA slightly skewed towards the min, slightly skewed towards the max, and FamilyIncome is very skewed towards the max with potential outliers.

As you can see, reading numbers from a table, while not difficult, is also not too convenient. Fortunately, we have more intuitive tools for this task: histograms and boxplots.

Illustrating numeric distributions with histograms

Histograms are a type of figures that illustrates numeric distributions based on the frequencies of bins of values within the column. This means, the data in the column is first split into bins with continuous ranges of values. For each bin, the plot then draws a bar of which height represents the bin’s frequency. In a histogram, the horizontal axis shows the value ranges, and the vertical axis shows the frequencies. An example of histogram is as below. We can observe that, in this data, there are five values between 0 and 1.1, 14 between 1.1 and 2.2, 18 between 2.2 and 3.3, and so on.

Drawing histograms is easy with Pandas and Matplotlib. Simply call hist() from a dataframe, Pandas will automatically select all numeric columns then draw a histogram for each. hist() has a few options that I commonly change: bins= sets the number of bins (bars) in the histogram, figsize=(width,heigh) sets the width and height of the whole plot, and layout=(row,column) sets the number of row and column in the layout of the subplots. Additionally, using hist() alone generates some intermediate outputs for the drawing. To remove those, we can use the show() function from the module pyplot in Matplotlib. And of course, we have to import it before using.

Putting it all together, let us try drawing histograms for the students data. In the output below, we can certainly confirm our findings using descriptive statistics. In all columns, the mean and median are in areas with the majority of data. HighSchoolGPA is a bit left-skewed – its histogram has a small tail on the left side. The histograms of AvgDailyStudyTime, TotalAbsence, and FirstYearGPA look well symmetrical. Lastly, FamilyIncome has a very long and narrow tail to the right, indicating that it is heavily right-skewed with outliers.

In [45]:

import matplotlib.pyplot as plt

students.hist(bins=20, figsize=(10,6), layout=(2,3))
plt.show()

Illustrating numeric distributions with boxplots

Boxplots are figures that illustrate five descriptive statistics of numeric distributions: median, 25th percentile or Q1, 75th percentile or Q3, Q1 - 1.5*IQR (distribution min), and Q3 + 1.5*IQR (distribution max), with IQR being the interquartile range and is Q3 - Q1. In general, a box plot looks like the below illustration. 50% of the data is always within the box area, and the majority of data should be within the distribution min and max of which anything outside is considered outliers. In a boxplot, if the median is in the middle of the box and the tails, the distribution is symmetric. On the other hand, if either tail is very long compared to the box, the distribution is skewed. Sometimes boxplots can add the mean as a diamond shape.

Back to the Pandas and Matplotlib, we can draw boxplots for all numerical columns using the dataframe function plot() with options kind='box', subplots=True, and sharey=False. Optionally, we can set figsize to control the size of the whole figure. We should also adjust the space between the plots for easier reading with pyplot.subplots_adjust(wspace=<space>). Now, to apply on the students data:

In [56]:

import matplotlib.pyplot as plt

students.plot(
    kind='box', 
    subplots=True, 
    sharey=False, 
    figsize=(12, 3)
)

# increase spacing between subplots
plt.subplots_adjust(wspace=0.5) 
plt.show()

Again, the boxplots confirm our findings with descriptive statistics and histograms. They also show some outliers in every columns. But as long as the outliers are not really far from the distribution min and max (like in FamilyIncome), we do not have to worry.

Conclusion

In this post, we have discussed and done some hands-on with analysis on numeric distributions. As you can see, after this analysis, we have a much better idea on what are there in the columns. We can even detect some notable issues like the skewness and outliers in FamilyIncome. Such issues do need to be addressed in a later phase. Coming up next, we will discuss analysis on categorical distributions. See you there then!

Exploring Numeric Distributions

Data for demonstration

`describe()` on numeric distributions

Interpreting `describe()` results

Illustrating numeric distributions with histograms

Illustrating numeric distributions with boxplots

Conclusion

3 Comments

Data for demonstration

describe() on numeric distributions

Interpreting describe() results

Illustrating numeric distributions with histograms

Illustrating numeric distributions with boxplots

Conclusion

3 Comments

`describe()` on numeric distributions

Interpreting `describe()` results