an illustration of pandas column operations including mathematical calculations, comparisons, and functions

As we have now known Pandas dataframe loading and slicing, it is time to move on to actions! There is a sayings that goes “if you don’t love data at its worst, you don’t analyze it at its best” (or something like that). And it is true. More often than not, data comes to you in its worst form; and it is your job, no one else, to make it better. You may have to fix some irregular values, fill missing data, create new columns, remove irrelevant ones, etc. To do any of that, you need to first understand the basic column operations in Pandas. So, that is what we will do in this post.

Elementwise column operations

There are many similarity between Pandas dataframes and NumPy arrays. One first thing is that you can apply all the elementwise operations that we have learned in NumPy on dataframes, from mathematical calculations, comparisons, to functions. Furthermore, now, we can attach the results back to the dataframe as columns very easily.

So, let us have some demonstrations. I will use the students_standing.csv data throughout this post. You can also download the complete notebook here. In short, it is a small data set that consists of students information like IDs, names, high school GPA (HSGPA), first year GPA (FYGPA), and first year academic standing (standing). We begin just like any other posts with importing and aliasing Pandas. Then, we read the data in as a dataframe, assign it to students, and view its first five rows with head().

students_standing.csv Download

In [7]:

import pandas as pd

students = pd.read_csv('students_standing.csv')
students.head(n=5)

Out[7]:

	StudentID	FirstName	LastName	HSGPA	FYGPA	Standing
0	202005537	Eunice	Ehmann	2.47	2.42	average
1	202008560	Hobert	Schoenberger	2.27	2.05	average
2	202004948	Nicholas	Sizer	4.00	3.96	good
3	202001207	Elvin	Foulks	3.16	2.64	average
4	202000260	Bruno	Viney	3.82	3.99	good

Next, we will go through each type of elementwise column operations.

Mathematical calculations

Applying math calculations on dataframes’ columns is super easy: simply slice them and do whatever you want with them. You can have columns interacting with constants or other columns. For example, you can try converting HSGPA to an 100-point system by doing students['HSPA'] * 100 / 4. Even better, you can store the result as a new column in the dataframe. The syntax to do that is just like a variable assignment, however, with the variable part being the slice of the new column: dataframe['new_column'] = value_to_assign. I will call the new column HSGPA100, so the code goes as below:

In [8]:

students['HSGPA100'] = students['HSGPA'] * 100 / 4
students.head(n=5)

Out[8]:

	StudentID	FirstName	LastName	HSGPA	FYGPA	Standing	HSGPA100
0	202005537	Eunice	Ehmann	2.47	2.42	average	61.75
1	202008560	Hobert	Schoenberger	2.27	2.05	average	56.75
2	202004948	Nicholas	Sizer	4.00	3.96	good	100.00
3	202001207	Elvin	Foulks	3.16	2.64	average	79.00
4	202000260	Bruno	Viney	3.82	3.99	good	95.50

As you can see above, HSGPA100 is directly added to the dataframe. You can also create multiple new columns simultaneously in the same way: slicing and assigning. I will demonstrate by calculating both HSGPA and FYGPA in base 10. New columns appear at the right end of the dataframe.

In [9]:

students[['HSGPA10','FYGPA10']] = students[['HSGPA','FYGPA']] * 10 / 4
students.head(n=5)

Out[9]:

	StudentID	FirstName	LastName	HSGPA	FYGPA	Standing	HSGPA100	HSGPA10	FYGPA10
0	202005537	Eunice	Ehmann	2.47	2.42	average	61.75	6.175	6.050
1	202008560	Hobert	Schoenberger	2.27	2.05	average	56.75	5.675	5.125
2	202004948	Nicholas	Sizer	4.00	3.96	good	100.00	10.000	9.900
3	202001207	Elvin	Foulks	3.16	2.64	average	79.00	7.900	6.600
4	202000260	Bruno	Viney	3.82	3.99	good	95.50	9.550	9.975

Elementwise comparisons

Elementwise comparisons are another type of common column operations. There are countless of examples in practice. Here, we can simply have new columns checking whether the students have GPAs above 3, or their first year GPAs increase or drop compared to high school GPAs (geq means greater than or equal to). I will first reset the dataframe to remove the previously created columns. The results from all comparisons are boolean columns.

In [37]:

students[['HSGPA_geq3','FYGPA_geq3']] = students[['HSGPA','FYGPA']] >= 3
students.head(3)

Out[37]:

	StudentID	FirstName	LastName	HSGPA	FYGPA	Standing	HSGPA_geq3	FYGPA_geq3
0	202005537	Eunice	Ehmann	2.47	2.42	average	False	False
1	202008560	Hobert	Schoenberger	2.27	2.05	average	False	False
2	202004948	Nicholas	Sizer	4.00	3.96	good	True	True

In [38]:

students['FYGPA_geq_HS'] = students['FYGPA'] >= students['HSGPA']
students.head(3)

Out[38]:

	StudentID	FirstName	LastName	HSGPA	FYGPA	Standing	HSGPA_geq3	FYGPA_geq3	FYGPA_geq_HS
0	202005537	Eunice	Ehmann	2.47	2.42	average	False	False	False
1	202008560	Hobert	Schoenberger	2.27	2.05	average	False	False	False
2	202004948	Nicholas	Sizer	4.00	3.96	good	True	True	False

Let us dig a bit deeper on boolean columns. Do you know that they can be used to slice a dataframe? In fact, they are one of the mechanisms of slicing. When we use a boolean array as a slice, rows associating with True are selected and those with False are discarded from the result. For example, if I use the HSGPA >= 3 as the slice, only rows having HSGPA_geq3 as True remain. This is particular useful in creating new features as we will see later on.

In [40]:

students.loc[students['HSGPA'] >= 3, :]

Out[40]:

	StudentID	FirstName	LastName	HSGPA	FYGPA	Standing	HSGPA_geq3	FYGPA_geq3	FYGPA_geq_HS
2	202004948	Nicholas	Sizer	4.00	3.96	good	True	True	False
3	202001207	Elvin	Foulks	3.16	2.64	average	True	False	False
…	…	…	…	…	…	…	…	…	…
197	202008725	Thaddeus	Chen	3.59	3.50	good	True	True	False
199	202009418	Sidney	Sienkiewicz	3.19	3.30	good	True	True	True

108 rows × 9 columns

Mathematical functions as column operations

NumPy and Pandas play so nice with each other: all those elementwise functions that we discussed before are usable with dataframes! Again, some of them are particularly common, such as logarithm, squared root, or absolute value. So, let us try that. Now that we want to use functions from NumPy, of course we have to import (and should alias) the library.

In [44]:

import numpy as np

students = pd.read_csv('students_standing.csv')
students[['HSGPA_sqrt','FYGPA_sqrt']] = np.sqrt(students[['HSGPA','FYGPA']])
students[['HSGPA_log','FYGPA_log']] = np.log(students[['HSGPA','FYGPA']])
students[['HSGPA_log10','FYGPA_log10']] = np.log10(students[['HSGPA','FYGPA']])
students.head(n=3)

Out[44]:

	StudentID	FirstName	LastName	HSGPA	FYGPA	Standing	HSGPA_sqrt	FYGPA_sqrt	HSGPA_log	FYGPA_log	HSGPA_log10	FYGPA_log10
0	202005537	Eunice	Ehmann	2.47	2.42	average	1.571623	1.555635	0.904218	0.883768	0.392697	0.383815
1	202008560	Hobert	Schoenberger	2.27	2.05	average	1.506652	1.431782	0.819780	0.717840	0.356026	0.311754
2	202004948	Nicholas	Sizer	4.00	3.96	good	2.000000	1.989975	1.386294	1.376244	0.602060	0.597695

We will surely revisit these functions when we get to discuss preprocessing data.

Conclusion

In this post, we have discussed three common column operations in Pandas: mathematical calculations, comparisons, and functions. Surely, Pandas still has much to offer. In the next post, I will talk about working with text columns in Pandas. Ciao!