It is very often that we have numeric columns with very different scales in the same data set. For example, a data set may have people’s income in the range of hundred of thousands to millions while also containing their ages just in between 18 to 100. As it turns out, some analytical models handle these differences very poorly. The nearest neighbor imputation approach we have discussed previously is one example. More specifically, the distance metric it uses to determine nearest neighbors of instances is heavily influenced and dominated by columns with large scale like income. Nearest neighbor is just one example. Many other models suffer from this issue. Therefore, one of the first steps we do during preprocessing is to scale numeric data. Scaling means to transform numeric columns into the same ranges afterward. In this post, I will go through four common methods to scale numeric data.
Data to demonstrate
I will use a different version of the students data where I created more numeric columns with different ranges so that we can observe the effect of scaling. The complete Jupyter notebook can be obtained here. The data is students-numeric.csv
. Most of the scaling methods in this post are from the SKLearn library, so we import it along with Pandas before reading the data in. An quick info()
then reveals that the columns’ scales are very different: GPA
columns go up to 4
, AvgDailyStudyTime
14
, FirstYearCredit
24
, Total Absence
37
, HSRankPercentile
gets to 100
, and SAT
scores cap at 800
. Finally, we draw a boxplot for each column in the same figure to visualize how their ranges differ. And as you can see, a lot! SATMath
and SATVerbal
completely dominate all other columns in terms of values. The others’ boxplots all shrink to the level that we can barely see them.
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('students-numeric.csv')
data.describe()
HighSchoolGPA | HSRankPercent | AvgDailyStudyTime | TotalAbsence | SATMath | SATVerbal | FirstYearCredit | FirstYearGPA | |
---|---|---|---|---|---|---|---|---|
count | 1000.000000 | 1000.000000 | 985.000000 | 990.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 |
mean | 3.017420 | 72.967000 | 6.132305 | 17.222222 | 550.862000 | 486.461000 | 18.937000 | 2.702000 |
std | 0.491055 | 11.462692 | 2.346976 | 5.785613 | 92.725298 | 133.101286 | 2.894721 | 0.546332 |
min | 1.390000 | 36.000000 | 0.000000 | 1.000000 | 266.000000 | 185.000000 | 14.000000 | 1.200000 |
25% | 2.670000 | 65.000000 | 4.610000 | 13.000000 | 482.000000 | 383.000000 | 16.000000 | 2.360000 |
50% | 3.030000 | 73.000000 | 6.180000 | 17.000000 | 554.000000 | 471.000000 | 19.000000 | 2.695000 |
75% | 3.340000 | 81.000000 | 7.760000 | 21.000000 | 618.000000 | 582.000000 | 21.000000 | 3.072500 |
max | 4.000000 | 100.000000 | 13.970000 | 37.000000 | 785.000000 | 800.000000 | 24.000000 | 4.000000 |
data.plot.box(rot=20, figsize=(10,4))
plt.show()
Min-max normalization
The first method to scale numeric data that we will discuss is min-max normalization. This approach transform all columns into having the minimum value of 0.0
and maximum value 1.0
. In short, each value in a column is transformed as
In Python, we will use the model class MinMaxScaler
from SKLearn to perform this transformation. And just like a few SKLearn models we have discussed so far, they are really easy to use. Simply create a new model then perform fit_transform()
on the data.
The example with our students data is as below. Because all SKLearn transformations result in a NumPy array, I try a small workaround to quickly assign the array back to a dataframe while keeping all the columns’ names. First, we create a copy of the original data. Then, we assign the result from fit_transform()
to a slice of the new dataframe instead of itself. This way, only the values are copied over, and I keep all other dataframe information. Finally, we draw the boxplots again. As you can observe, all column have their minimum values and maximum values at exact 0
and 1
, respectively.
from sklearn.preprocessing import MinMaxScaler
data_normalized = data.copy()
mmscaler = MinMaxScaler()
data_normalized[:] = mmscaler.fit_transform(data)
data_normalized.plot.box(rot=20, figsize=(10,4))
plt.show()
This method has one issue in that it is sensitivity to outliers. If the column maximum is a very big outlier, min-max scaling may reduce the effective range of the column to very close to 0
. Below is one example with FamilyIncome
included in the normalization. In the result, the range from 0.1
to 1
are just outliers, and the regular data is shrunk to 0
to about 0.01
or 0.02
. So, be careful to check if there are very high outliers before using min-max normalization.
Standardization
Standardization is the process of transform a column so that it has mean at 0 and standard deviation at 1 afterward with the below formula
To standardize data in Python, we do it exactly similar to min-max normalization and only change the model class to StandardScaler
. Unlike in normalization, after transformation, the minimum values and maximum values are not the same anymore. However, you can see their centers and effective ranges are more similar.
from sklearn.preprocessing import StandardScaler
data_standardized = data.copy()
stdscaler = StandardScaler()
data_standardized[:] = stdscaler.fit_transform(data)
data_standardized.plot.box(rot=20, figsize=(10,4))
plt.show()
Robust Scaling
Standardizing data uses the mean and standard deviation in its transformation. And you probably know where I am going with this. As we have discussed several times, mean and standard deviation are easily influenced by outliers. And, their counterpart that we will use when there are many outliers are always the median and interquartile range – IQR. Using the latter two instead of the former two is called robust scaling with the below formular
In SKLearn, we simply replace the model class with RobustScaler
to use this transformation. As you can see, now all the medians are at 0, and all the boxes have the same height, showing the effect of applying the above formula.
from sklearn.preprocessing import RobustScaler
data_rbscaled = data.copy()
rbscaler = RobustScaler()
data_rbscaled[:] = rbscaler.fit_transform(data)
data_rbscaled.plot.box(rot=20, figsize=(10,4))
plt.show()
Pareto scaling
The three methods that we discussed previously remove all scale information from data. In general this is not an issue. However, some specific scientific areas like metabolomics prefer to keep some variability in the scales after transformation. For that reason, they opt to use Pareto scaling instead. This method is very similar to standardization, however, instead of dividing by the standard deviation, the denominator becomes its square root:
This is not a very common method, so SKLearn does not seem to have it. Nevertheless, we can just write our own transformation using NumPy and Pandas as below. You can see that the code is exactly the same as the formula. In terms of result, all columns have new means at 0. They also keep some of their original scales, which is the purpose of this scaling method.
import numpy as np
data_prtscaled = (data - data.mean()) / np.sqrt(data.std())
data_prtscaled.plot.box(rot=20, figsize=(10,4))
plt.show()
Wrapping up
Many analytical models are susceptive to big differences among columns’ scales. For that reason, we need to scale numeric data at the beginning of any analysis. In this post, I have introduced four methods for data scaling: normalization, standardization, robust scaling, and Pareto scaling. In general, normalization and standardization are the most common. We prefer robust scaling when there are a lot of outliers. And, Pareto scaling is only famous in some specific areas. Regardless, you can always play around, try a few, or even all methods, and see which one you like. Now, I will conclude this post here. See you again soon!
Pingback: Processing Pipeline - Data Science from a Practical Perspective