an illustration of methods to scale numeric data including normalization, standardization, and robust scaling

It is very often that we have numeric columns with very different scales in the same data set. For example, a data set may have people’s income in the range of hundred of thousands to millions while also containing their ages just in between 18 to 100. As it turns out, some analytical models handle these differences very poorly. The nearest neighbor imputation approach we have discussed previously is one example. More specifically, the distance metric it uses to determine nearest neighbors of instances is heavily influenced and dominated by columns with large scale like income. Nearest neighbor is just one example. Many other models suffer from this issue. Therefore, one of the first steps we do during preprocessing is to scale numeric data. Scaling means to transform numeric columns into the same ranges afterward. In this post, I will go through four common methods to scale numeric data.

Data to demonstrate

I will use a different version of the students data where I created more numeric columns with different ranges so that we can observe the effect of scaling. The complete Jupyter notebook can be obtained here. The data is students-numeric.csv. Most of the scaling methods in this post are from the SKLearn library, so we import it along with Pandas before reading the data in. An quick info() then reveals that the columns’ scales are very different: GPA columns go up to 4, AvgDailyStudyTime 14, FirstYearCredit 24, Total Absence 37, HSRankPercentile gets to 100, and SAT scores cap at 800. Finally, we draw a boxplot for each column in the same figure to visualize how their ranges differ. And as you can see, a lot! SATMath and SATVerbal completely dominate all other columns in terms of values. The others’ boxplots all shrink to the level that we can barely see them.

students-numeric.csv Download

In [26]:

import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('students-numeric.csv')
data.describe()

Out[26]:

	HighSchoolGPA	HSRankPercent	AvgDailyStudyTime	TotalAbsence	SATMath	SATVerbal	FirstYearCredit	FirstYearGPA
count	1000.000000	1000.000000	985.000000	990.000000	1000.000000	1000.000000	1000.000000	1000.000000
mean	3.017420	72.967000	6.132305	17.222222	550.862000	486.461000	18.937000	2.702000
std	0.491055	11.462692	2.346976	5.785613	92.725298	133.101286	2.894721	0.546332
min	1.390000	36.000000	0.000000	1.000000	266.000000	185.000000	14.000000	1.200000
25%	2.670000	65.000000	4.610000	13.000000	482.000000	383.000000	16.000000	2.360000
50%	3.030000	73.000000	6.180000	17.000000	554.000000	471.000000	19.000000	2.695000
75%	3.340000	81.000000	7.760000	21.000000	618.000000	582.000000	21.000000	3.072500
max	4.000000	100.000000	13.970000	37.000000	785.000000	800.000000	24.000000	4.000000

In [25]:

data.plot.box(rot=20, figsize=(10,4))
plt.show()

Min-max normalization

The first method to scale numeric data that we will discuss is min-max normalization. This approach transform all columns into having the minimum value of 0.0 and maximum value 1.0. In short, each value in a column is transformed as

$new\_value = \dfrac{old\_value - column\_min}{column\_max - column\_min}$

In Python, we will use the model class MinMaxScaler from SKLearn to perform this transformation. And just like a few SKLearn models we have discussed so far, they are really easy to use. Simply create a new model then perform fit_transform() on the data.

The example with our students data is as below. Because all SKLearn transformations result in a NumPy array, I try a small workaround to quickly assign the array back to a dataframe while keeping all the columns’ names. First, we create a copy of the original data. Then, we assign the result from fit_transform() to a slice of the new dataframe instead of itself. This way, only the values are copied over, and I keep all other dataframe information. Finally, we draw the boxplots again. As you can observe, all column have their minimum values and maximum values at exact 0 and 1, respectively.

In [19]:

from sklearn.preprocessing import MinMaxScaler

data_normalized = data.copy()
mmscaler = MinMaxScaler()
data_normalized[:] = mmscaler.fit_transform(data)

In [20]:

data_normalized.plot.box(rot=20, figsize=(10,4))
plt.show()

This method has one issue in that it is sensitivity to outliers. If the column maximum is a very big outlier, min-max scaling may reduce the effective range of the column to very close to 0. Below is one example with FamilyIncome included in the normalization. In the result, the range from 0.1 to 1 are just outliers, and the regular data is shrunk to 0 to about 0.01 or 0.02. So, be careful to check if there are very high outliers before using min-max normalization.

an illustration of normalization being susceptive to outliers

Standardization

Standardization is the process of transform a column so that it has mean at 0 and standard deviation at 1 afterward with the below formula

$new\_value = \dfrac{old\_value - column\_mean}{column\_SD}$

To standardize data in Python, we do it exactly similar to min-max normalization and only change the model class to StandardScaler. Unlike in normalization, after transformation, the minimum values and maximum values are not the same anymore. However, you can see their centers and effective ranges are more similar.

In [21]:

from sklearn.preprocessing import StandardScaler

data_standardized = data.copy()
stdscaler = StandardScaler()
data_standardized[:] = stdscaler.fit_transform(data)

In [22]:

data_standardized.plot.box(rot=20, figsize=(10,4))
plt.show()

Robust Scaling

Standardizing data uses the mean and standard deviation in its transformation. And you probably know where I am going with this. As we have discussed several times, mean and standard deviation are easily influenced by outliers. And, their counterpart that we will use when there are many outliers are always the median and interquartile range – IQR. Using the latter two instead of the former two is called robust scaling with the below formular

$new\_value = \dfrac{old\_value - column\_median}{column\_IQR}$

In SKLearn, we simply replace the model class with RobustScaler to use this transformation. As you can see, now all the medians are at 0, and all the boxes have the same height, showing the effect of applying the above formula.

In [23]:

from sklearn.preprocessing import RobustScaler

data_rbscaled = data.copy()
rbscaler = RobustScaler()
data_rbscaled[:] = rbscaler.fit_transform(data)

In [24]:

data_rbscaled.plot.box(rot=20, figsize=(10,4))
plt.show()

Pareto scaling

The three methods that we discussed previously remove all scale information from data. In general this is not an issue. However, some specific scientific areas like metabolomics prefer to keep some variability in the scales after transformation. For that reason, they opt to use Pareto scaling instead. This method is very similar to standardization, however, instead of dividing by the standard deviation, the denominator becomes its square root:

$new\_value = \dfrac{old\_value - column\_mean}{\sqrt{column\_SD}}$

This is not a very common method, so SKLearn does not seem to have it. Nevertheless, we can just write our own transformation using NumPy and Pandas as below. You can see that the code is exactly the same as the formula. In terms of result, all columns have new means at 0. They also keep some of their original scales, which is the purpose of this scaling method.

In [16]:

import numpy as np

data_prtscaled = (data - data.mean()) / np.sqrt(data.std())

In [17]:

data_prtscaled.plot.box(rot=20, figsize=(10,4))
plt.show()

Wrapping up

Many analytical models are susceptive to big differences among columns’ scales. For that reason, we need to scale numeric data at the beginning of any analysis. In this post, I have introduced four methods for data scaling: normalization, standardization, robust scaling, and Pareto scaling. In general, normalization and standardization are the most common. We prefer robust scaling when there are a lot of outliers. And, Pareto scaling is only famous in some specific areas. Regardless, you can always play around, try a few, or even all methods, and see which one you like. Now, I will conclude this post here. See you again soon!

Scale Numeric Data