What is tabular data?
The most basic form of a data set with which we can work is probably the tabular data. Tabular data is data that is organized as rows and columns in a table. In general, a table is equivalent to a data set, which is a collection of information of some objects belonging to the same entity. In tabular data, the rows represent specific objects of the entity, and columns represent features of the entity that the objects share.
To make it clearer, let us examine the small table below:
This is a table that consists of information on some college students. Therefore, it is a data set the student entity. Each row in this data set carries information of one student, for example, the first row is the student Alice Smith, the second row is the student Bob Menke. Next, since columns represent features, this data set has nine of them, i.e., Student ID
, First Name
, Last Name
, Major
, State
, Zip Code
, Age
, GPA
, and academic Standing
.
As you explore data science materials from other sources, you will see some synonyms for rows and columns. Below are some of their most common ones:
– Synonyms for row: data point, instance, data instance, sample, and record
– Synonyms for column: attribute, variable, and feature
Next, we have some very important concepts to remember, which are the types of columns. There are two main types of columns in tabular data, numeric and categorical.
Numeric data
Numeric columns are columns of which values are meaningful numbers. This means that mathematical operations like adding, subtracting, averaging, comparison, etc., applied on values in these columns yield meaningful results. For examples:
– Age is a numeric column, because we can calculations like the average age of is meaningful – it is the average age of all students in data, or subtracting the ages of two students is meaningful as the result is their difference in age.
– Zip code is not a numeric column, even if it appears as numbers. The reason is that, if you, for example, calculate the average zip codes, the result is meaningless. Mathematically comparing two zip codes like 30063 > 10007 is also meaningless.
So, a quick and reliable way to verify whether a column is actually numeric is to try a few mathematical operations on its values and determine if the result is meaningful.
Categorical data
Categorical columns are columns of which values are discrete categories or classes. They usually are text values, but in certain cases still look like numbers like the Zip column we discussed previously. Categorical columns have two subtypes which are Ordinal and Nominal.
– Ordinal columns are those of which values have inherent orders. This means you can compare them. For example, in academic standing, the values are Poor
, Average
, and Good
, which have an order: Poor < Average < Good
. So, academic standing is an ordinal column.
– Nominal columns are those of which values have no inherent orders. For example, in State
or Zip
, comparisons of values like TN > GA
or 30063 > 10007
are meaningless. Therefore, we conclude that State and Zip are nominal.
So, what types are Student ID
, First Name
, and Last Name
? They are definitely not numeric. Are they categorical? First, Student ID is not categorical, because its values are unique for every row in the data. Student ID is usually referred to as an identifier because each specific ID identifies a different student. How about names? Technically, they can be categorical. However, they are usually not too useful in most (if not all) analysis. We will get back to these three and discuss how to treat them later on.
Closing words
This post is now quite long with a fair amount of information, so I will stop here. Hopefully, at this point, you will have had a basic understanding of the tabular data and its related concepts. In the next post, we will discuss the common types of analysis we can do.
Pingback: Text Data in Pandas - Data Science from a Practical Perspective
Pingback: Encode Categorical Data - Data Science from a Practical Perspective