Column operations

an illustration of pandas column operations including mathematical calculations, comparisons, and functions

As we have now known Pandas dataframe loading and slicing, it is time to move on to actions! There is a sayings that goes “if you don’t love data at its worst, you don’t analyze it at its best” (or something like that). And it is true. More often than not, data comes to you in its worst form; and it is your job, no one else, to make it better. You may have to fix some irregular values, fill missing data, create new columns, remove irrelevant ones, etc. To do any of that, you need to first understand the basic column operations in Pandas. So, that is what we will do in this post.

Elementwise column operations

There are many similarity between Pandas dataframes and NumPy arrays. One first thing is that you can apply all the elementwise operations that we have learned in NumPy on dataframes, from mathematical calculations, comparisons, to functions. Furthermore, now, we can attach the results back to the dataframe as columns very easily.

So, let us have some demonstrations. I will use the students_standing.csv data throughout this post. You can also download the complete notebook here. In short, it is a small data set that consists of students information like IDs, names, high school GPA (HSGPA), first year GPA (FYGPA), and first year academic standing (standing). We begin just like any other posts with importing and aliasing Pandas. Then, we read the data in as a dataframe, assign it to students, and view its first five rows with head().

Next, we will go through each type of elementwise column operations.

Mathematical calculations

Applying math calculations on dataframes’ columns is super easy: simply slice them and do whatever you want with them. You can have columns interacting with constants or other columns. For example, you can try converting HSGPA to an 100-point system by doing students['HSPA'] * 100 / 4. Even better, you can store the result as a new column in the dataframe. The syntax to do that is just like a variable assignment, however, with the variable part being the slice of the new column: dataframe['new_column'] = value_to_assign. I will call the new column HSGPA100, so the code goes as below:

As you can see above, HSGPA100 is directly added to the dataframe. You can also create multiple new columns simultaneously in the same way: slicing and assigning. I will demonstrate by calculating both HSGPA and FYGPA in base 10. New columns appear at the right end of the dataframe.

Elementwise comparisons

Elementwise comparisons are another type of common column operations. There are countless of examples in practice. Here, we can simply have new columns checking whether the students have GPAs above 3, or their first year GPAs increase or drop compared to high school GPAs (geq means greater than or equal to). I will first reset the dataframe to remove the previously created columns. The results from all comparisons are boolean columns.

Let us dig a bit deeper on boolean columns. Do you know that they can be used to slice a dataframe? In fact, they are one of the mechanisms of slicing. When we use a boolean array as a slice, rows associating with True are selected and those with False are discarded from the result. For example, if I use the HSGPA >= 3 as the slice, only rows having HSGPA_geq3 as True remain. This is particular useful in creating new features as we will see later on.

Mathematical functions as column operations

NumPy and Pandas play so nice with each other: all those elementwise functions that we discussed before are usable with dataframes! Again, some of them are particularly common, such as logarithm, squared root, or absolute value. So, let us try that. Now that we want to use functions from NumPy, of course we have to import (and should alias) the library.

We will surely revisit these functions when we get to discuss preprocessing data.

Conclusion

In this post, we have discussed three common column operations in Pandas: mathematical calculations, comparisons, and functions. Surely, Pandas still has much to offer. In the next post, I will talk about working with text columns in Pandas. Ciao!