an illustration of pandas dataframe and slicing them

In the previous post, we have learned how to import data from several different formats into Python for analysis. One similarity among those data-loading functions is that they all generate a DataFrame – the main data structure for tabular data in Pandas. Needless to say, DataFrames are highly important to understand to manage and work with data effectively. In this post, I will discuss basic operations with Pandas DataFrame, so let us dive in!

What is a Pandas DataFrame

DataFrame is the main structure that Pandas uses to store tabular data (or in general two-dimensional data). They are similar to a 2D NumPy array in the row-column representation, however, with more flexibility towards data analytics. Rows and columns in DataFrame can be named and indexed for easy accesses and management.

In terms of hands-on, I will use the students.csv data for demonstration purpose in this post. You can get the complete notebook here. As usual, first we import Pandas then load the data into the students variable which makes it a DataFrame. As you can see, printing a DataFrame in Jupyter gives us a pretty nice table view. The the data headers (column names) are bolded and on top, and the bolded numbers on the left are the names of the rows. In this data, row names are indexes, but they can also can be ids, timestamps, etc. depending on the situation.

students.csv Download

In [9]:

import pandas as pd

students = pd.read_csv('students.csv')
students

Out[9]:

	StudentID	FirstName	LastName	HighSchoolGPA	FirstYearGPA
0	202005537	Eunice	Ehmann	2.47	2.42
1	202008560	Hobert	Schoenberger	2.27	2.05
…	…	…	…	…	…
198	202001120	Wright	Marburger	2.41	2.11
199	202009418	Sidney	Sienkiewicz	3.19	3.30

200 rows × 5 columns

DataFrame basics

After importing, we can use several functions to obtain initial impressions about the data (these must be called from a specific DataFrame, not Pandas). A common first function is head(n=<number>) which displays the first number rows in the data. You can use head() to examine how the newly imported data looks like. I also use like to perform head() after processing data (along with others) to verify if the result is correct.

In [7]:

students.head(n=3)

Out[7]:

	StudentID	FirstName	LastName	HighSchoolGPA	FirstYearGPA
0	202005537	Eunice	Ehmann	2.47	2.42
1	202008560	Hobert	Schoenberger	2.27	2.05
2	202004948	Nicholas	Sizer	4.00	3.96

Another important function is info() which outputs important information about the DataFrame. If I call info() from students, the output looks like below. First, we see that the data has 200 entries and five columns. The function also displays details about each column including their indexes, names, non-null count, and data type. You should be familiar with the concepts of indexes and column names already. Next, non-null count, is the number of valid entries in the column. Lastly, data type is the type into which Pandas read the values in the column. Both non-null count and data type are important because they directly relate to the issues of missing data and incorrect data types that we must address in data processing.

In [2]:

students.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   StudentID      200 non-null    int64  
 1   FirstName      200 non-null    object 
 2   LastName       200 non-null    object 
 3   HighSchoolGPA  200 non-null    float64
 4   FirstYearGPA   200 non-null    float64
dtypes: float64(2), int64(1), object(2)
memory usage: 7.9+ KB

info() may give too much information at times. If you just want to have a quick count of rows and columns, the property shape may be the better one. We have already seen shape in NumPy array. Here, it behaves exactly the same. A small note is that shape is different from info() and head() in that it is a property, not a function, so it does not come with parentheses ().

In [10]:

students.shape

Out[10]:

(200, 5)

Slicing Pandas DataFrame

Slicing columns

DataFrames are similar to NumPy arrays, we can slice them to get a closer look at some component. However, the slicing syntax is slightly different in Pandas. To begin with, we can now slice a DataFrame using columns’ names. To obtain a single column, use the syntax dataframe[column]. Note that column must be a column’s name and not its index. The result is a Pandas Series which is the data structure for 1D data.

In [12]:

students['HighSchoolGPA']

Out[12]:

0      2.47
1      2.27
2      4.00
       ... 
198    2.41
199    3.19
Name: HighSchoolGPA, Length: 200, dtype: float64

To get multiple columns, we include them in a list dataframe[[column list]]. It is very important to not forget the second set of brackets []. In this case, the result slice is a DataFrame. You can also slice a single column using this syntax to get a DataFrame instead of a Series.

In [13]:

students[['FirstName','LastName','FirstYearGPA']]

Out[13]:

	FirstName	LastName	FirstYearGPA
0	Eunice	Ehmann	2.42
1	Hobert	Schoenberger	2.05
…	…	…	…
198	Wright	Marburger	2.11
199	Sidney	Sienkiewicz	3.30

200 rows × 3 columns

Slicing both rows and columns

Slicing both rows and columns requires the use of the loc property. The syntax is as follows.

In [ ]:

dataframe.loc[<row_slice> , <column_slice>]

where row_slice represents how to select rows, and column_slice columns. The most basic way to write these slices is to use the rows’ names and columns’ names. Similarly like before, to slice multiple rows or columns, we need to put all name values in a list. Both row_slice and column_slice can be replaced with a colon : to represent “everything”. In the example below, row_slice is [10,60,110], and column_slice is ['FirstName','LastName','FirstYearGPA'], so only FirstName, LastName, and FirstYearGPA, of rows whose names are 10, 60, and 110 are included in the result slice.

In [16]:

students.loc[[10,60,110],['FirstName','LastName','FirstYearGPA']]

Out[16]:

	FirstName	LastName	FirstYearGPA
10	Olin	Mcclain	2.09
60	Benjamin	Brinkerhoff	2.44
110	Bessie	Bushman	2.90

Like in NumPy, row_slice and column_slice can also be condition-based. For examples, we can slice all columns from students whose first year GPAs are above 3, or all whose first names are “Elvin”.

In [17]:

students.loc[students['FirstYearGPA']>=3.0,:]

Out[17]:

	StudentID	FirstName	LastName	HighSchoolGPA	FirstYearGPA
2	202004948	Nicholas	Sizer	4.00	3.96
4	202000260	Bruno	Viney	3.82	3.99
5	202003083	Alan	Borg	3.10	3.38
…	…	…	…	…	…
195	202000691	Judge	Grandinetti	3.44	3.35
197	202008725	Thaddeus	Chen	3.59	3.50
199	202009418	Sidney	Sienkiewicz	3.19	3.30

In [22]:

students.loc[students['FirstName']=='Elvin',:]

Out[22]:

	StudentID	FirstName	LastName	HighSchoolGPA	FirstYearGPA
3	202001207	Elvin	Foulks	3.16	2.64
97	202002022	Elvin	Monje	2.95	2.08
133	202007347	Elvin	Stansell	2.82	2.37

We can also slice DataFrame using indexes of rows and columns with the iloc property. Slicing dataframe.iloc is exactly like slicing a 2D NumPy array, so I will not repeat that here.

Wrapping up

DataFrame is indeed important and there are just a lot to talk about them. This post has introduced you to the basics of DataFrame including how to get their initial information and how to slice them. In the next post, I will talk about merging DataFrame. See you again!