The first tool we will be learning in the data science stack in Python is NumPy arrays. NumPy is a Python package for numerical manipulations. It comes with powerful features like types for arrays, vectors, matrices, vectorized mathematical operations, linear algebra operations, etc. and is the base of many other data science packages. Depending on the types of analysis, you may use NumPy from a little bit to very intensively. Nevertheless, NumPy is pretty much unavoidable while working with data. For that reason, in this post, I will introduce you to the basic usages of NumPy arrays including creation and slicing. As we are focusing on tabular data, in this post, the term “array” only refers those that represent tables. The notebook of this post is available here.
Why NumPy
So why do we need NumPy? Can base Python handle data? Actually yes! Data is a collection of informational objects, and surely Python has a lot of collection types. For example, the small data set below can be stored in a 2-dimensional list (essentially a list of lists)
Employee ID | Age | Year at Works | Salary |
100320 | 36 | 5 | 110000 |
132201 | 30 | 3 | 105000 |
200212 | 45 | 12 | 133000 |
143695 | 27 | 1 | 80000 |
employees = [
[100320,36,5,110000],
[132201,30,3,105000],
[100212,45,12,133000],
[143695,27,1,80000]
]
for employee in employees:
print(employee)
[100320, 36, 5, 110000] [132201, 30, 3, 105000] [100212, 45, 12, 133000] [143695, 27, 1, 80000]
So, each employee in the data is a list of four values, and the whole data set is a list of employees. So far okay, right? Sure, if you just want to store data. The issue begins when you want to work with it. For examples, to get the column Age
, you need to write a loop. To calculate the mean of Age
, you need to use two functions.
age = []
for row in employees:
age.append(row[1])
print(age)
[36, 30, 45, 27]
sum(age) / len(age)
34.5
What happened in the cells above? In the first one, I have to manually iterate through each row in the data to append each one’s age to an empty list age
. Then, I have to manually calculate the mean of age
using the function sum()
and len()
(len()
gives the numbers of items in a collection). Pretty inconvenient right? And those are just two very simple data operations.
For such reasons, we do not write everything ourselves. Instead, we utilize tools that have been developed and tested. And, NumPy is the first one we will discuss.
Basic NumPy arrays
As NumPy is an external library, you need to install it. If you follow my post on setting up your Python workbench, you should have NumPy already. Of course, to use a library, we need to first import it in the code. When importing, we can give NumPy an alias, for example np
, using the as
key word. NumPy can then be referred to with np
instead of its full name.
import numpy as np
Creating the data set from before using NumPy is fairly similar to doing so with a list. We will use np.array()
which stores data in an array object (one of the collection types in NumPy). Immediately, you can already see the difference just from printing. To display the list data nicely, I have to write a loop. A NumPy array, on the other hand, will appear nicely by itself, and the columns even align!
employees = np.array([
[100320,36,5,110000],
[132201,30,3,105000],
[100212,45,12,133000],
[143695,27,1,80000]
])
employees
array([[100320, 36, 5, 110000], [132201, 30, 3, 105000], [100212, 45, 12, 133000], [143695, 27, 1, 80000]])
A very useful property from an array is shape
which stores the number of rows and columns in the data. To use shape
, we call it from the array variable (notice the dot .
that represents shape belonging to an array object)
employees.shape
(4, 4)
You do not have to manually type all the data entries to create an array. It is just very tedious (small data sets still have a few hundred rows). Instead, you can load the contents from a data file. However, this is usually done in combinations with Pandas, a Python library for data manipulation. So, we will get back to reading data from files by then.
Slicing NumPy arrays
Remember the slicing of lists? We can also slice a NumPy array, but with much more flexibility. The syntax to slice an array that stores tabular data is as follows. row slice
represents how we want to select the rows, and column slice
represents how to select the columns. column slice
can be omitted, in which case we only select rows.
array[<row slice>, <column slice>]
The easiest way to write slices is to use index. Similar to lists, indexes in array are the positional numbers of the rows or columns, starting from 0. To get slice multiple row or column indexes, we put them in a list. For example, the cells below select the first row (index 0), and the first and forth (index 3) rows in the array.
employees[0]
array([100320, 36, 5, 110000])
employees[[0,3]]
array([[100320, 36, 5, 110000], [143695, 27, 1, 80000]])
Unlike rows, if you want to slice columns only, you need to replace row slice with a colon :
. And we can certainly slice both rows and columns at the same times. Below, I select column 0, columns 1 and 3, and finally, rows 0 to 2, and columns 1 and 3, in the array.
employees[:,0]
array([100320, 132201, 100212, 143695])
employees[:,[1,3]]
array([[ 36, 110000], [ 30, 105000], [ 45, 133000], [ 27, 80000]])
employees[:3,[1,3]]
array([[ 36, 110000], [ 30, 105000], [ 45, 133000]])
A bit more advance, we can slice arrays using conditions. For example, get all rows whose ages are above 32, or get all rows who salaries are below $120,000. The syntax for these kinds of conditions is array[:,<column index>] <compare> <value>
where column index
is a single index number, compare
is one among the operator <
, <=
, ==
, >=
, >
, !
, and <value>
is the referencing value. Below, the code in the first cell select rows whose ages are above 30, and the second cell selects rows whose salaries are below $110,000.
employees[employees[:,1] > 30] #rows in which age (2nd column) > 30
array([[100320, 36, 5, 110000], [100212, 45, 12, 133000]])
employees[employees[:,-1] < 110000] #rows in which salary (last column) < 110000
array([[132201, 30, 3, 105000], [143695, 27, 1, 80000]])
What’s next?
In this post, I have introduced some basic usages of NumPy arrays including creating and slicing them. The functionalities of NumPy and the array types are, however, much more than that, which we will step-by-step explore. I will stop this post here, and discuss operations with NumPy arrays in the next one. See you again!
Pingback: NumPy Operations - Data Science from a Practical Perspective
Pingback: Array Concatenation - Data Science from a Practical Perspective
Pingback: Pandas DataFrame - Data Science from a Practical Perspective