With an overview understanding about distribution analysis, let us actually perform those, starting with numerical data. Obviously, we will be using a mixture of Pandas and Matplotlib – a powerful Python package for visualization. Like previously introduced, we utilize descriptive statistics and certain types of charts to represent numeric distributions. So, let us start!
Data for demonstration
Throughout this post, I will use the students1000.csv
data. The complete notebook is available here. Similar to the other students data we have been using, this one consists of information about students and their GPAs. I also added a few columns to cover more situations that we may see during analysis. Like usual, we start the session with importing pandas and read the data. We then perform an info()
to get some more information on the columns.
import pandas as pd
students = pd.read_csv('students1000.csv')
students.head(n=3)
StudentID | FirstName | LastName | Major | HighSchoolGPA | FamilyIncome | State | AvgDailyStudyTime | TotalAbsence | FirstYearGPA | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 202303595 | Baxter | Dengler | Computer Science | 2.82 | 45013 | WA | 2.01 | 14.0 | 1.93 |
1 | 202309162 | Christian | Wickey | Data Science | 3.07 | 128358 | GA | 5.41 | NaN | 2.76 |
2 | 202306337 | Lonnie | Wulff | Software Engineering | 2.68 | 112392 | GA | 9.57 | 13.0 | 3.09 |
students.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 StudentID 1000 non-null int64 1 FirstName 1000 non-null object 2 LastName 1000 non-null object 3 Major 1000 non-null object 4 HighSchoolGPA 1000 non-null float64 5 FamilyIncome 1000 non-null int64 6 State 982 non-null object 7 AvgDailyStudyTime 985 non-null float64 8 TotalAbsence 990 non-null float64 9 FirstYearGPA 1000 non-null float64 dtypes: float64(4), int64(2), object(4) memory usage: 78.2+ KB
All columns are in their correct data types – categorical and ID columns are object
, and numerical columns are int64
or float64
. However, StudentID
is an ID column and should not be treated as numeric. We will most likely drop it later, but for now, let us change it to object
.
students['StudentID'] = students['StudentID'].astype('object')
students['StudentID'].dtype
dtype('O')
Now, we should be ready to move to the next part.
describe()
on numeric distributions
The first thing that we look at in numerical columns is their descriptive statistics. Luckily, Pandas provides a dataframe function describe()
that does exactly that for us. It generates a table that summarizes the main statistics for all numeric columns. Note that my first statement is just to print float numbers with two decimal digits for easier readability.
pd.set_option('display.float_format', lambda x: '%.2f' % x)
students.describe()
HighSchoolGPA | FamilyIncome | AvgDailyStudyTime | TotalAbsence | FirstYearGPA | |
---|---|---|---|---|---|
count | 1000.00 | 1000.00 | 985.00 | 990.00 | 1000.00 |
mean | 3.02 | 139345.95 | 6.13 | 17.22 | 2.70 |
std | 0.49 | 200938.10 | 2.35 | 5.79 | 0.55 |
min | 1.39 | 17378.00 | 0.00 | 1.00 | 1.20 |
25% | 2.67 | 50311.25 | 4.61 | 13.00 | 2.36 |
50% | 3.03 | 92038.00 | 6.18 | 17.00 | 2.70 |
75% | 3.34 | 174144.75 | 7.76 | 21.00 | 3.07 |
max | 4.00 | 4125854.00 | 13.97 | 37.00 | 4.00 |
Interpreting describe()
results
The default result of describe is a table with the columns being features in your data, and rows representing the statistics count
, mean
, standard deviation std
, minimum min
, 25%
, 50%
which is the median, 75%
, and maximum max
.
And what are the 25% and 75%? They are the 25th and 75th percentile, which are calculated similarly to the median, but instead of taking the middle point, we take the 25% point and 75% point. The difference 75th percentile – 25th percentile is called the Inter-Quartile Range (IQR) of the data, and is a measurement of dispersion like standard deviation. By the way, the min
is the 0%
and the max
100%
.
There are a few things to look at here:
– mean
compared to std
. If std
is much higher than mean
, the distributions could be very skewed or having some problems. A skewed distribution means the ways values disperse to two sides from the center are different and not symmetrical. Here, we have one such column which is FamilyIncome
. The rest seems okay.
– mean
compared to median
. If mean is very different from median, the distribution could be skewed. From the describe()
result, FamilyIncome
shows sign of skewness, other columns are good on this.
– min
and max
compared to mean
, 25%
, 50%
and 75%
. If min
or max
is very different from the other four, there are potential skewness and/or outliers. In this data, FamilyIncome
has a very high max
while HighSchoolGPA has a relatively low min
.
Overall, the result suggests that FirstYearGPA
, AvgDailyStudyTime
, and TotalAbsence
seems to have a symmetric distribution; HighSchoolGPA
slightly skewed towards the min, slightly skewed towards the max, and FamilyIncome
is very skewed towards the max with potential outliers.
As you can see, reading numbers from a table, while not difficult, is also not too convenient. Fortunately, we have more intuitive tools for this task: histograms and boxplots.
Illustrating numeric distributions with histograms
Histograms are a type of figures that illustrates numeric distributions based on the frequencies of bins of values within the column. This means, the data in the column is first split into bins with continuous ranges of values. For each bin, the plot then draws a bar of which height represents the bin’s frequency. In a histogram, the horizontal axis shows the value ranges, and the vertical axis shows the frequencies. An example of histogram is as below. We can observe that, in this data, there are five values between 0
and 1.1
, 14 between 1.1
and 2.2
, 18 between 2.2
and 3.3
, and so on.
Drawing histograms is easy with Pandas and Matplotlib. Simply call hist()
from a dataframe, Pandas will automatically select all numeric columns then draw a histogram for each. hist()
has a few options that I commonly change: bins=
sets the number of bins (bars) in the histogram, figsize=(width,heigh)
sets the width and height of the whole plot, and layout=(row,column)
sets the number of row and column in the layout of the subplots. Additionally, using hist()
alone generates some intermediate outputs for the drawing. To remove those, we can use the show()
function from the module pyplot
in Matplotlib. And of course, we have to import it before using.
Putting it all together, let us try drawing histograms for the students
data. In the output below, we can certainly confirm our findings using descriptive statistics. In all columns, the mean and median are in areas with the majority of data. HighSchoolGPA
is a bit left-skewed – its histogram has a small tail on the left side. The histograms of AvgDailyStudyTime
, TotalAbsence
, and FirstYearGPA
look well symmetrical. Lastly, FamilyIncome
has a very long and narrow tail to the right, indicating that it is heavily right-skewed with outliers.
import matplotlib.pyplot as plt
students.hist(bins=20, figsize=(10,6), layout=(2,3))
plt.show()
Illustrating numeric distributions with boxplots
Boxplots are figures that illustrate five descriptive statistics of numeric distributions: median
, 25th percentile or Q1
, 75th percentile or Q3
, Q1 - 1.5*IQR
(distribution min), and Q3 + 1.5*IQR
(distribution max), with IQR
being the interquartile range and is Q3 - Q1
. In general, a box plot looks like the below illustration. 50% of the data is always within the box area, and the majority of data should be within the distribution min and max of which anything outside is considered outliers. In a boxplot, if the median is in the middle of the box and the tails, the distribution is symmetric. On the other hand, if either tail is very long compared to the box, the distribution is skewed. Sometimes boxplots can add the mean as a diamond shape.
Back to the Pandas and Matplotlib, we can draw boxplots for all numerical columns using the dataframe function plot()
with options kind='box'
, subplots=True
, and sharey=False
. Optionally, we can set figsize
to control the size of the whole figure. We should also adjust the space between the plots for easier reading with pyplot.subplots_adjust(wspace=<space>)
. Now, to apply on the students
data:
import matplotlib.pyplot as plt
students.plot(
kind='box',
subplots=True,
sharey=False,
figsize=(12, 3)
)
# increase spacing between subplots
plt.subplots_adjust(wspace=0.5)
plt.show()
Again, the boxplots confirm our findings with descriptive statistics and histograms. They also show some outliers in every columns. But as long as the outliers are not really far from the distribution min and max (like in FamilyIncome
), we do not have to worry.
Conclusion
In this post, we have discussed and done some hands-on with analysis on numeric distributions. As you can see, after this analysis, we have a much better idea on what are there in the columns. We can even detect some notable issues like the skewness and outliers in FamilyIncome
. Such issues do need to be addressed in a later phase. Coming up next, we will discuss analysis on categorical distributions. See you there then!
Pingback: Analyzing Categorical Distributions - Data Science from a Practical Perspective
Pingback: Correlation Analysis on Two Numeric Columns - Data Science from a Practical Perspective
Pingback: Correlation with Categorical Data - Data Science from a Practical Perspective