Categories are a big part in tabular data. You will see them more often than not, and it is just inevitable. However, a lot of analytical models cannot handle categorical data by itself. In those cases, you need to find ways to transform classes into meaningful numbers. And, that is the topic of this post. We will discuss two different methods to encode categorical data.
Types of categorical data
We discussed different types of categorical quite some times ago. Let me quickly summarize it so we can move on. There are majorly three types: binary, ordinal, and nominal.
Binary data means that there are exactly two unique values in the column. Examples of this type are agree
or disagree
, good
or bad
, low
or high
, etc. The two values in binary categories are usually transformed into 0
and 1
, orders do not matter. There are different ways to do this, but a convenient method is to process binary data with an ordinal encoder.
Ordinal data means categories that have inherent orders. More specifically, you can sort them in some comparative ways. For example, [very bad, bad, okay, good, very good]
, [low, medium, high]
, or [very disagree, disagree, neutral, agree, very agree]
. Since these categories have orders, we can somewhat transform them into numbers like 1, 2, 3...
However, you should proceed with cares as we will discuss later in this post.
Nominal data means categories that does not have inherent orders. Therefore, you cannot compare the categories or sort them (alphabetic order does not count!). Some examples are states, cities, postal codes where you cannot compare, says, New York
to New Hampshire
, or Boston
to Seattle
. One way of transforming nominal data to numbers is a technique called One Hot Encoder or Dummy Variables.
Data to demonstrate
In this post, I will use the sample_categorical_data.csv
file. This is a small synthetic data set with all columns being categorical. The complete Jupyter notebook for this post is available here. To begin, we load the data, print its info()
, and draw a bar chart for each column so we can observe their distributions. Based on the classes of each column, we have a binary column which is Area
, ordinal columns AgeGroup
and Opinion
, and nominal ones State
and EmploymentSector
. Another note is that State
and EmploymentSector
have some rare values.
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('sample-categorical-data.csv')
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 33 entries, 0 to 32 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 AgeGroup 33 non-null object 1 State 33 non-null object 2 Area 33 non-null object 3 EmploymentSector 33 non-null object 4 Opinion 33 non-null object dtypes: object(5) memory usage: 1.4+ KB
for col in data.columns:
print(col)
data[col].value_counts().plot.bar(rot=30, figsize=(4,4))
plt.show()
AgeGroup
State
Area
EmploymentSector
Opinion
One hot encoder
We can apply this method to encode categorical data of any types, however, with some caveats. First, the encoder will create a new binary column for each distinct value in the original categorical column. For example, if a class column has three unique values, for example, (low, medium, high), the one hot codes will have three new binary columns. Among these three, the value of the one corresponding to the original class value is 1
, and 0
otherwise.
Below is one illustrative example. The State
column has three unique values, New York
, Massachusetts
, and New Hampshire
. Its one hot codes would then include three new binary columns, is_NY
corresponding to New York
, is_MA
for Massachusetts
, and is_NH
for New Hampshire
. Now, for each row, its one hot field corresponding to the original class is set to 1
, and the rest 0
. Specifically, the first row has State
as New York
, so its is_NY
is 1
, and both is_MA
and is_NH
0
; the third row is Massachusetts
, so its is_MA
is 1
and the rest 0
; and so on. Also, the last column can be omitted because we know the state is New Hampshire
if both is_NY
and is_MA
are 0
.
State |
New York |
New York |
Massachusetts |
New Hampshire |
Massachusetts |
is_NY | is_MA | is_NH |
1 | 0 | 0 |
1 | 0 | 0 |
0 | 1 | 0 |
0 | 0 | 1 |
0 | 1 | 0 |
The sparsity problem
One potential with one hot encoder is its susceptibility to a large number of classes. As you can see, this method create one new column for each unique class in data. So, some thousands of classes means that number of new columns. Furthermore, each row will only have a single 1 and the rest are 0. So, your data can get very big, however, a majority of them is 0
, and just a small proportion is 1
. We call this issue very sparse data and it is particularly bad.
So, I would recommend to use one hot encoder only if you have a few hundred of classes. What if you have more? There are a few ways, but the easiest is to keep only frequent classes, and collapse everything else into a single other
class. For example, you can set a filter like only classes that are in more than 1%, 5%, or 10% of rows are frequent.
In Python
We can either use Pandas or SKLearn for one hot encoder. However, I will use SKLearn since it is easier to incorporate in a pipeline that we will discuss in the future. In this case, we will use the class model OneHotEncoder
. However, there are a few things we have to set up to begin. First, we need to create a list of columns to perform one hot encoder. We can use this on ordinal data just fine, but, I will only include State
and EmploymentSector
and leave the rest for later. Next, we set a frequency threshold to define rare classes, which I call infreq_threshold
in the code. 0.1
means only keeping classes that have frequencies over 10% of data.
Now, we can create an one hot encoder. We add are two options here, min_frequency
for defining rare classes by the encoder which is the actual number for 10% of data calculated from infreq_threshold
. Next, handle_unknown
is set to infrequent_if_exist
so that the rare classes form a new column. Finally, we call fit_transform()
to train and apply the encoder. As you can see, the output is just binary data. Each row has exactly two values of 1
because we only transformed two columns.
nom_cols = ['State', 'EmploymentSector']
infreq_threshold = 0.1
from sklearn.preprocessing import OneHotEncoder
oh_encoder = OneHotEncoder(min_frequency=int(len(data) * infreq_threshold), handle_unknown='infrequent_if_exist')
oh_codes = oh_encoder.fit_transform(data[nom_cols])
oh_codes.todense()
matrix([[0., 0., 1., 0., 0., 0., 0., 1., 0.], [0., 0., 0., 1., 0., 0., 1., 0., 0.], [1., 0., 0., 0., 0., 1., 0., 0., 0.], ... [0., 0., 0., 0., 1., 0., 0., 1., 0.], [0., 0., 0., 0., 1., 0., 1., 0., 0.]])
Encoding ordinal data
As I mentioned above, we can totally use one hot encoder to encode categorical data of any types. However, there are certain cases that we can try transforming them to integer numbers like 0
, 1
, 2
… to keep the order information. In SKLearn, we use the OrdinalEncoder
model class.
Like one hot encoder, first, we create a list of columns to perform ordinal encoding. We include the binary Area
here because we want it as 0
and 1
. Next, we need to provide a list of ordered values for each column. In each ordered list, the first class becomes 0, second class 1, and so on. Without this, OrdinalEncoder
will sort your classes alphabetically, so beware! Finally, we create the encoder with the option categories
set to the ordered list then train and apply it with fit_transform()
. As you can see, we have exactly three new columns, and values in them are number 0, 1, 2...
corresponding to values in ord_classes
.
ord_cols = ['AgeGroup', 'Area', 'Opinion']
ord_classes = [
['20 to 40', '40 to 60', 'over 60'],
['Rural', 'Urban'],
['Very Disagree', 'Disagree', 'Neutral', 'Agree', 'Very Agree']
]
from sklearn.preprocessing import OrdinalEncoder
ord_encoder = OrdinalEncoder(categories=ord_classes)
ord_codes = ord_encoder.fit_transform(data[ord_cols])
ord_codes
array([[1., 0., 2.], [1., 0., 3.], [0., 0., 1.], [2., 0., 4.], ... [0., 1., 2.], [0., 1., 0.]])
Now, be very careful when you use this method on actual ordinal data. The thing is, while you can sort these classes, they may still not be actual numbers. For example, Disagree < Neutral < Agree
, however, does Neutral - Disagree
equal to Agree - Neutral
? Because we imply that relationship when converting them to 1
, 2
, and 3
. So, you need to carefully consider the meaning of the classes before attempting this method.
Conclusion
In this post, we discuss how to encode categorical data. While it is straight forward most of the time, you still need to be careful, for examples, when your data have too many classes, or if the ordinal classes are ambiguous as being numeric. In general, one hot encoder with controls on infrequent classes is pretty safe. There are also more advance ways to transform categorical data, but we will discuss them after getting to regression and classification. For now, happy encoding, and see you again!
Pingback: Processing Pipeline - Data Science from a Practical Perspective