an illustration of loading data from different sources into a Pandas DataFrame

NumPy is a very powerful library for numerical manipulations. However, to some extends, NumPy is fairly “low level” in that writing codes with it is not that “friendly” to data analysts without a strong programming background. While understanding basic NumPy operations is important as they are the basics of many libraries, we usually turn to Pandas for data manipulation. Pandas comes with a huge toolset for handling data and simple analysis. But, before any analysis, you need to load data into your Python session. So, in this post, I will introduce some common formats of data sources and how to read them using Pandas. Note that all the data-loading functions from Pandas will generate a DataFrame object, a very important concept that will have its own post later.

Loading data from CSV files

You can get all the codes in this post from this notebook.

Comma Separated Values files

CSV stands for comma separated values, meanings that these files using commas , to split data fields. CSV files are plain text files, so we can open them using any text editors like Notepad, Notepad++, gedit, etc. The csv representation stores each data instance in a different line and delimit their data values using ,. Furthermore, values at the same indexes across the lines are of the same columns. Usually (but not guaranteed), the first line carries the headers (names) of the columns in the data.

Below is an example of data on students stored in a CSV file. As you can see, the first line comprises of headers, namely StudentID, FirstName, LastName, etc. The second lines until the end are specific students. Values in each line follow the same order as the headers, so the first value of every line is StudentID, second field FirstName, and so on.

In [ ]:

StudentID,FirstName,LastName,HighSchoolGPA,FirstYearGPA
202005537,Eunice,Ehmann,2.47,2.42
202008560,Hobert,Schoenberger,2.27,2.05
202004948,Nicholas,Sizer,4,3.96

Loading csv files with Pandas

Loading data from csv files in Pandas can be done with the read_csv() function called directly from Pandas. At minimum, you need to provide the path to the data file to this function as a string. If the file is in the same working folder as your notebook, you can use just its name. Below is an example of loading students.csv in the working folder. Like NumPy, first we import Pandas and give it an alias (pd). Then, we use read_csv() which result in a Pandas DataFrame object. Using print() on a DataFrame will display some of its rows and columns. You can download students.csv at the top of the following cell.

students.csv Download

In [8]:

students_csv = pd.read_csv('students.csv')
students_csv

Out[8]:

	StudentID	FirstName	LastName	HighSchoolGPA	FirstYearGPA
0	202005537	Eunice	Ehmann	2.47	2.42
1	202008560	Hobert	Schoenberger	2.27	2.05
…	…	…	…	…	…
198	202001120	Wright	Marburger	2.41	2.11
199	202009418	Sidney	Sienkiewicz	3.19	3.30

200 rows × 5 columns

read_csv() has many other options, however, two that you most likely have to occasionally use are header=None and sep='<separator>‘. The first option instructs Pandas to read the first line as data and not headers, and is used when your data does not come with header. The second option changes the default separator from , to a character that you set. In general, if you are not sure about the structure of your csv files, just open them in a text editor to verify headers and separators. For example, to open a file data.csv without headers, using | as separator, and is in the current working folder

data = pd.read_csv('data.csv', header=None, sep='|')

Loading data from fixed-width format files

Fixed-width format (fwf) files are similar to csv in that they are plain text files – can be opened with a text editor. Each line in fwf files also represent one data instance. However, they do not use , to separate fields. Rather, each column have their specific character positions in the file. For examples, in the file below, with character index starts from 0, StudentID takes characters 0 to 8 in every lines, FirstName 9 to 21, LastName 22 to 34, HSGPA 35 to 40, and FYGPA 41 to 45.

In [ ]:

StudentID    FirstName     LastName HSGPA FYGPA
202005537       Eunice       Ehmann  2.47  2.42
202008560       Hobert Schoenberger  2.27  2.05
202004948     Nicholas        Sizer   4.0  3.96

Loading data from fwf files with Pandas is done with read_fwf(). Besides the file path, you need to define the index range of each column in the option colspecs = [[start1, end1+1], [start2, end2+1], [start3, end3+1]...]. We can also set header=None if there are no column names. So for the data above, we have [colspecs = [0, 9], ,[9, 22], [22, 35], [35, 41], [41, 46]]. Adding together with read_fwf(), we have the code below for the students.txt file (fwf files can have any extensions).

In [28]:

students_fwf = pd.read_fwf('students.txt', colspecs=[[0, 9], [9, 22], [22, 35], [35, 41], [41, 46]])
students_fwf

Out[28]:

	StudentID	FirstName	LastName	HSGPA	FYGP
0	202005537	Eunice	Ehmann	2.47	2.4
1	202008560	Hobert	Schoenberger	2.27	2.0
…	…	…	…	…	…
198	202001120	Wright	Marburger	2.41	2.1
199	202009418	Sidney	Sienkiewicz	3.19	3.0

200 rows × 5 columns

Loading data from spreadsheets

Even if you are not in data science, you have probably known about spreadsheets of which the most famous is xlsx files used by Microsoft Excel. Pandas attributes a function for loading data from spreadsheets of many types (xls, xlsx, xlsm, xlsb, odf, ods and odt), namely read_excel(). Note that, before we can use this function, we need to install the openpyxl library. So, in a console window, run pip install openpyxl if you use Windows, or sudo pip install openpyxl if you Mac or Linux.

Like read_csv(), we need to at least provide the file path, and we can set header=None if there are no column names. As spreadsheets may have many sheets which we can specify with the sheet_name option. I will showcase one example below with the spreadsheet students.xlsx. The file has header and stores data in the students sheet. Again, you can download the file to practice.

students.xlsx Download

In [10]:

students_excel = pd.read_excel('students.xlsx',sheet_name='students')
students_excel

Out[10]:

	StudentID	FirstName	LastName	HighSchoolGPA	FirstYearGPA
0	202005537	Eunice	Ehmann	2.47	2.42
1	202008560	Hobert	Schoenberger	2.27	2.05
2	202004948	Nicholas	Sizer	4.00	3.96
3	202001207	Elvin	Foulks	3.16	2.64
4	202000260	Bruno	Viney	3.82	3.99
…	…	…	…	…	…
195	202000691	Judge	Grandinetti	3.44	3.35
196	202009695	Winston	Kaneko	2.94	2.41
197	202008725	Thaddeus	Chen	3.59	3.50
198	202001120	Wright	Marburger	2.41	2.11
199	202009418	Sidney	Sienkiewicz	3.19	3.30

200 rows × 5 columns

Loading data from JSON files

The last data format we discuss in this post is JavaScript Object Notation – JSON files. This type of files is also plain text (openable with text editors). JSON files store objects as name:value pairs similar to the Python dictionaries that we discussed a bit ago. In fact, if you view the raw content of a JSON file, you may mistake it with a bunch of dictionaries. To store tabular data in JSON files, each instance is a set of column : value pairs. An example of several data points in JSON format is showed below.

In [ ]:

[
    {"StudentID":202005537, "FirstName":"Eunice", "LastName":"Ehmann", "HighSchoolGPA":2.47, "FirstYearGPA":2.42},
    {"StudentID":202008560, "FirstName":"Hobert", "LastName":"Schoenberger", "HighSchoolGPA":2.27, "FirstYearGPA":2.05},
    {"StudentID":202004948, "FirstName":"Nicholas", "LastName":"Sizer", "HighSchoolGPA":4.0, "FirstYearGPA":3.96},
    {"StudentID":202001207, "FirstName":"Elvin", "LastName":"Foulks", "HighSchoolGPA":3.16, "FirstYearGPA":2.64}
]

Using Pandas, loading data from JSON files requires the use of the function read_json(). Unlike others, read_json() is fairly simple in that you most likely just need to provide the file path. You can find the code to read the complete students.json file as follows.

In [34]:

students_json = pd.read_json('students.json')
students_json

Out[34]:

	StudentID	FirstName	LastName	HighSchoolGPA	FirstYearGPA
0	202005537	Eunice	Ehmann	2.47	2.42
1	202008560	Hobert	Schoenberger	2.27	2.05
2	202004948	Nicholas	Sizer	4.00	3.96
3	202001207	Elvin	Foulks	3.16	2.64
4	202000260	Bruno	Viney	3.82	3.99
…	…	…	…	…	…
195	202000691	Judge	Grandinetti	3.44	3.35
196	202009695	Winston	Kaneko	2.94	2.41
197	202008725	Thaddeus	Chen	3.59	3.50
198	202001120	Wright	Marburger	2.41	2.11
199	202009418	Sidney	Sienkiewicz	3.19	3.30

200 rows × 5 columns

Summary

In this post, we have learned about four common formats for storing data, namely comma-separated-values, fix-width formatted, spreadsheets, and JSON. We also discuss how to load each types into a Python session as a DataFrame object. So, in the next post, we will see what DataFrames are, and what we can do with them. See you again then!

Loading Data with Pandas

Loading data from CSV files

Comma Separated Values files

Loading csv files with Pandas

Loading data from fixed-width format files

Loading data from spreadsheets

Loading data from JSON files

Summary

2 Comments