NumPy is a very powerful library for numerical manipulations. However, to some extends, NumPy is fairly “low level” in that writing codes with it is not that “friendly” to data analysts without a strong programming background. While understanding basic NumPy operations is important as they are the basics of many libraries, we usually turn to Pandas for data manipulation. Pandas comes with a huge toolset for handling data and simple analysis. But, before any analysis, you need to load data into your Python session. So, in this post, I will introduce some common formats of data sources and how to read them using Pandas. Note that all the data-loading functions from Pandas will generate a DataFrame object, a very important concept that will have its own post later.
Loading data from CSV files
You can get all the codes in this post from this notebook.
Comma Separated Values files
CSV stands for comma separated values, meanings that these files using commas ,
to split data fields. CSV files are plain text files, so we can open them using any text editors like Notepad, Notepad++, gedit, etc. The csv representation stores each data instance in a different line and delimit their data values using ,
. Furthermore, values at the same indexes across the lines are of the same columns. Usually (but not guaranteed), the first line carries the headers (names) of the columns in the data.
Below is an example of data on students stored in a CSV file. As you can see, the first line comprises of headers, namely StudentID
, FirstName
, LastName
, etc. The second lines until the end are specific students. Values in each line follow the same order as the headers, so the first value of every line is StudentID
, second field FirstName
, and so on.
StudentID,FirstName,LastName,HighSchoolGPA,FirstYearGPA
202005537,Eunice,Ehmann,2.47,2.42
202008560,Hobert,Schoenberger,2.27,2.05
202004948,Nicholas,Sizer,4,3.96
Loading csv files with Pandas
Loading data from csv files in Pandas can be done with the read_csv()
function called directly from Pandas. At minimum, you need to provide the path to the data file to this function as a string. If the file is in the same working folder as your notebook, you can use just its name. Below is an example of loading students.csv
in the working folder. Like NumPy, first we import Pandas and give it an alias (pd
). Then, we use read_csv()
which result in a Pandas DataFrame object. Using print()
on a DataFrame will display some of its rows and columns. You can download students.csv
at the top of the following cell.
students_csv = pd.read_csv('students.csv')
students_csv
StudentID | FirstName | LastName | HighSchoolGPA | FirstYearGPA | |
---|---|---|---|---|---|
0 | 202005537 | Eunice | Ehmann | 2.47 | 2.42 |
1 | 202008560 | Hobert | Schoenberger | 2.27 | 2.05 |
… | … | … | … | … | … |
198 | 202001120 | Wright | Marburger | 2.41 | 2.11 |
199 | 202009418 | Sidney | Sienkiewicz | 3.19 | 3.30 |
200 rows × 5 columns
read_csv()
has many other options, however, two that you most likely have to occasionally use are header=None
and sep='<separator>
‘. The first option instructs Pandas to read the first line as data and not headers, and is used when your data does not come with header. The second option changes the default separator from ,
to a character that you set. In general, if you are not sure about the structure of your csv files, just open them in a text editor to verify headers and separators. For example, to open a file data.csv
without headers, using |
as separator, and is in the current working folder
data = pd.read_csv('data.csv', header=None, sep='|')
Loading data from fixed-width format files
Fixed-width format (fwf) files are similar to csv in that they are plain text files – can be opened with a text editor. Each line in fwf files also represent one data instance. However, they do not use ,
to separate fields. Rather, each column have their specific character positions in the file. For examples, in the file below, with character index starts from 0
, StudentID
takes characters 0
to 8
in every lines, FirstName
9
to 21
, LastName
22
to 34
, HSGPA
35
to 40
, and FYGPA
41
to 45
.
StudentID FirstName LastName HSGPA FYGPA
202005537 Eunice Ehmann 2.47 2.42
202008560 Hobert Schoenberger 2.27 2.05
202004948 Nicholas Sizer 4.0 3.96
Loading data from fwf files with Pandas is done with read_fwf()
. Besides the file path, you need to define the index range of each column in the option colspecs = [[start1, end1+1], [start2, end2+1], [start3, end3+1]...]
. We can also set header=None
if there are no column names. So for the data above, we have [colspecs = [0, 9], ,[9, 22], [22, 35], [35, 41], [41, 46]]
. Adding together with read_fwf()
, we have the code below for the students.txt file (fwf files can have any extensions).
students_fwf = pd.read_fwf('students.txt', colspecs=[[0, 9], [9, 22], [22, 35], [35, 41], [41, 46]])
students_fwf
StudentID | FirstName | LastName | HSGPA | FYGP | |
---|---|---|---|---|---|
0 | 202005537 | Eunice | Ehmann | 2.47 | 2.4 |
1 | 202008560 | Hobert | Schoenberger | 2.27 | 2.0 |
… | … | … | … | … | … |
198 | 202001120 | Wright | Marburger | 2.41 | 2.1 |
199 | 202009418 | Sidney | Sienkiewicz | 3.19 | 3.0 |
200 rows × 5 columns
Loading data from spreadsheets
Even if you are not in data science, you have probably known about spreadsheets of which the most famous is xlsx
files used by Microsoft Excel. Pandas attributes a function for loading data from spreadsheets of many types (xls
, xlsx
, xlsm
, xlsb
, odf
, ods
and odt
), namely read_excel()
. Note that, before we can use this function, we need to install the openpyxl
library. So, in a console window, run pip install openpyxl
if you use Windows, or sudo pip install openpyxl
if you Mac or Linux.
Like read_csv()
, we need to at least provide the file path, and we can set header=None
if there are no column names. As spreadsheets may have many sheets which we can specify with the sheet_name
option. I will showcase one example below with the spreadsheet students.xlsx
. The file has header and stores data in the students
sheet. Again, you can download the file to practice.
students_excel = pd.read_excel('students.xlsx',sheet_name='students')
students_excel
StudentID | FirstName | LastName | HighSchoolGPA | FirstYearGPA | |
---|---|---|---|---|---|
0 | 202005537 | Eunice | Ehmann | 2.47 | 2.42 |
1 | 202008560 | Hobert | Schoenberger | 2.27 | 2.05 |
2 | 202004948 | Nicholas | Sizer | 4.00 | 3.96 |
3 | 202001207 | Elvin | Foulks | 3.16 | 2.64 |
4 | 202000260 | Bruno | Viney | 3.82 | 3.99 |
… | … | … | … | … | … |
195 | 202000691 | Judge | Grandinetti | 3.44 | 3.35 |
196 | 202009695 | Winston | Kaneko | 2.94 | 2.41 |
197 | 202008725 | Thaddeus | Chen | 3.59 | 3.50 |
198 | 202001120 | Wright | Marburger | 2.41 | 2.11 |
199 | 202009418 | Sidney | Sienkiewicz | 3.19 | 3.30 |
200 rows × 5 columns
Loading data from JSON files
The last data format we discuss in this post is JavaScript Object Notation – JSON files. This type of files is also plain text (openable with text editors). JSON files store objects as name:value
pairs similar to the Python dictionaries that we discussed a bit ago. In fact, if you view the raw content of a JSON file, you may mistake it with a bunch of dictionaries. To store tabular data in JSON files, each instance is a set of column : value
pairs. An example of several data points in JSON format is showed below.
[
{"StudentID":202005537, "FirstName":"Eunice", "LastName":"Ehmann", "HighSchoolGPA":2.47, "FirstYearGPA":2.42},
{"StudentID":202008560, "FirstName":"Hobert", "LastName":"Schoenberger", "HighSchoolGPA":2.27, "FirstYearGPA":2.05},
{"StudentID":202004948, "FirstName":"Nicholas", "LastName":"Sizer", "HighSchoolGPA":4.0, "FirstYearGPA":3.96},
{"StudentID":202001207, "FirstName":"Elvin", "LastName":"Foulks", "HighSchoolGPA":3.16, "FirstYearGPA":2.64}
]
Using Pandas, loading data from JSON files requires the use of the function read_json()
. Unlike others, read_json()
is fairly simple in that you most likely just need to provide the file path. You can find the code to read the complete students.json file as follows.
students_json = pd.read_json('students.json')
students_json
StudentID | FirstName | LastName | HighSchoolGPA | FirstYearGPA | |
---|---|---|---|---|---|
0 | 202005537 | Eunice | Ehmann | 2.47 | 2.42 |
1 | 202008560 | Hobert | Schoenberger | 2.27 | 2.05 |
2 | 202004948 | Nicholas | Sizer | 4.00 | 3.96 |
3 | 202001207 | Elvin | Foulks | 3.16 | 2.64 |
4 | 202000260 | Bruno | Viney | 3.82 | 3.99 |
… | … | … | … | … | … |
195 | 202000691 | Judge | Grandinetti | 3.44 | 3.35 |
196 | 202009695 | Winston | Kaneko | 2.94 | 2.41 |
197 | 202008725 | Thaddeus | Chen | 3.59 | 3.50 |
198 | 202001120 | Wright | Marburger | 2.41 | 2.11 |
199 | 202009418 | Sidney | Sienkiewicz | 3.19 | 3.30 |
200 rows × 5 columns
Summary
In this post, we have learned about four common formats for storing data, namely comma-separated-values, fix-width formatted, spreadsheets, and JSON. We also discuss how to load each types into a Python session as a DataFrame object. So, in the next post, we will see what DataFrames are, and what we can do with them. See you again then!
Pingback: Pandas DataFrame - Data Science from a Practical Perspective
Pingback: Column operations - Data Science from a Practical Perspective