In the previous post, we have learned how to import data from several different formats into Python for analysis. One similarity among those data-loading functions is that they all generate a DataFrame – the main data structure for tabular data in Pandas. Needless to say, DataFrames are highly important to understand to manage and work with data effectively. In this post, I will discuss basic operations with Pandas DataFrame, so let us dive in!
What is a Pandas DataFrame
DataFrame is the main structure that Pandas uses to store tabular data (or in general two-dimensional data). They are similar to a 2D NumPy array in the row-column representation, however, with more flexibility towards data analytics. Rows and columns in DataFrame can be named and indexed for easy accesses and management.
In terms of hands-on, I will use the students.csv
data for demonstration purpose in this post. You can get the complete notebook here. As usual, first we import Pandas then load the data into the students
variable which makes it a DataFrame. As you can see, printing a DataFrame in Jupyter gives us a pretty nice table view. The the data headers (column names) are bolded and on top, and the bolded numbers on the left are the names of the rows. In this data, row names are indexes, but they can also can be ids, timestamps, etc. depending on the situation.
import pandas as pd
students = pd.read_csv('students.csv')
students
StudentID | FirstName | LastName | HighSchoolGPA | FirstYearGPA | |
---|---|---|---|---|---|
0 | 202005537 | Eunice | Ehmann | 2.47 | 2.42 |
1 | 202008560 | Hobert | Schoenberger | 2.27 | 2.05 |
… | … | … | … | … | … |
198 | 202001120 | Wright | Marburger | 2.41 | 2.11 |
199 | 202009418 | Sidney | Sienkiewicz | 3.19 | 3.30 |
200 rows × 5 columns
DataFrame basics
After importing, we can use several functions to obtain initial impressions about the data (these must be called from a specific DataFrame, not Pandas). A common first function is head(n=<number>)
which displays the first number
rows in the data. You can use head()
to examine how the newly imported data looks like. I also use like to perform head()
after processing data (along with others) to verify if the result is correct.
students.head(n=3)
StudentID | FirstName | LastName | HighSchoolGPA | FirstYearGPA | |
---|---|---|---|---|---|
0 | 202005537 | Eunice | Ehmann | 2.47 | 2.42 |
1 | 202008560 | Hobert | Schoenberger | 2.27 | 2.05 |
2 | 202004948 | Nicholas | Sizer | 4.00 | 3.96 |
Another important function is info()
which outputs important information about the DataFrame. If I call info() from students
, the output looks like below. First, we see that the data has 200 entries and five columns. The function also displays details about each column including their indexes, names, non-null count, and data type. You should be familiar with the concepts of indexes and column names already. Next, non-null count, is the number of valid entries in the column. Lastly, data type is the type into which Pandas read the values in the column. Both non-null count and data type are important because they directly relate to the issues of missing data and incorrect data types that we must address in data processing.
students.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 200 entries, 0 to 199 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 StudentID 200 non-null int64 1 FirstName 200 non-null object 2 LastName 200 non-null object 3 HighSchoolGPA 200 non-null float64 4 FirstYearGPA 200 non-null float64 dtypes: float64(2), int64(1), object(2) memory usage: 7.9+ KB
info()
may give too much information at times. If you just want to have a quick count of rows and columns, the property shape
may be the better one. We have already seen shape
in NumPy array. Here, it behaves exactly the same. A small note is that shape
is different from info()
and head()
in that it is a property, not a function, so it does not come with parentheses ()
.
students.shape
(200, 5)
Slicing Pandas DataFrame
Slicing columns
DataFrames are similar to NumPy arrays, we can slice them to get a closer look at some component. However, the slicing syntax is slightly different in Pandas. To begin with, we can now slice a DataFrame using columns’ names. To obtain a single column, use the syntax dataframe[column]
. Note that column
must be a column’s name and not its index. The result is a Pandas Series which is the data structure for 1D data.
students['HighSchoolGPA']
0 2.47 1 2.27 2 4.00 ... 198 2.41 199 3.19 Name: HighSchoolGPA, Length: 200, dtype: float64
To get multiple columns, we include them in a list dataframe[[column list]]
. It is very important to not forget the second set of brackets []
. In this case, the result slice is a DataFrame. You can also slice a single column using this syntax to get a DataFrame instead of a Series.
students[['FirstName','LastName','FirstYearGPA']]
FirstName | LastName | FirstYearGPA | |
---|---|---|---|
0 | Eunice | Ehmann | 2.42 |
1 | Hobert | Schoenberger | 2.05 |
… | … | … | … |
198 | Wright | Marburger | 2.11 |
199 | Sidney | Sienkiewicz | 3.30 |
200 rows × 3 columns
Slicing both rows and columns
Slicing both rows and columns requires the use of the loc property. The syntax is as follows.
dataframe.loc[<row_slice> , <column_slice>]
where row_slice
represents how to select rows, and column_slice
columns. The most basic way to write these slices is to use the rows’ names and columns’ names. Similarly like before, to slice multiple rows or columns, we need to put all name values in a list. Both row_slice
and column_slice
can be replaced with a colon : to represent “everything”. In the example below, row_slice
is [10,60,110]
, and column_slice is ['FirstName','LastName','FirstYearGPA']
, so only FirstName
, LastName
, and FirstYearGPA
, of rows whose names are 10, 60, and 110 are included in the result slice.
students.loc[[10,60,110],['FirstName','LastName','FirstYearGPA']]
FirstName | LastName | FirstYearGPA | |
---|---|---|---|
10 | Olin | Mcclain | 2.09 |
60 | Benjamin | Brinkerhoff | 2.44 |
110 | Bessie | Bushman | 2.90 |
Like in NumPy, row_slice
and column_slice
can also be condition-based. For examples, we can slice all columns from students whose first year GPAs are above 3, or all whose first names are “Elvin”.
students.loc[students['FirstYearGPA']>=3.0,:]
StudentID | FirstName | LastName | HighSchoolGPA | FirstYearGPA | |
---|---|---|---|---|---|
2 | 202004948 | Nicholas | Sizer | 4.00 | 3.96 |
4 | 202000260 | Bruno | Viney | 3.82 | 3.99 |
5 | 202003083 | Alan | Borg | 3.10 | 3.38 |
… | … | … | … | … | … |
195 | 202000691 | Judge | Grandinetti | 3.44 | 3.35 |
197 | 202008725 | Thaddeus | Chen | 3.59 | 3.50 |
199 | 202009418 | Sidney | Sienkiewicz | 3.19 | 3.30 |
students.loc[students['FirstName']=='Elvin',:]
StudentID | FirstName | LastName | HighSchoolGPA | FirstYearGPA | |
---|---|---|---|---|---|
3 | 202001207 | Elvin | Foulks | 3.16 | 2.64 |
97 | 202002022 | Elvin | Monje | 2.95 | 2.08 |
133 | 202007347 | Elvin | Stansell | 2.82 | 2.37 |
We can also slice DataFrame using indexes of rows and columns with the iloc
property. Slicing dataframe.iloc
is exactly like slicing a 2D NumPy array, so I will not repeat that here.
Wrapping up
DataFrame is indeed important and there are just a lot to talk about them. This post has introduced you to the basics of DataFrame including how to get their initial information and how to slice them. In the next post, I will talk about merging DataFrame. See you again!