an illustration on creation and slicing of numpy arrays

The first tool we will be learning in the data science stack in Python is NumPy arrays. NumPy is a Python package for numerical manipulations. It comes with powerful features like types for arrays, vectors, matrices, vectorized mathematical operations, linear algebra operations, etc. and is the base of many other data science packages. Depending on the types of analysis, you may use NumPy from a little bit to very intensively. Nevertheless, NumPy is pretty much unavoidable while working with data. For that reason, in this post, I will introduce you to the basic usages of NumPy arrays including creation and slicing. As we are focusing on tabular data, in this post, the term “array” only refers those that represent tables. The notebook of this post is available here.

Why NumPy

So why do we need NumPy? Can base Python handle data? Actually yes! Data is a collection of informational objects, and surely Python has a lot of collection types. For example, the small data set below can be stored in a 2-dimensional list (essentially a list of lists)

Employee ID	Age	Year at Works	Salary
100320	36	5	110000
132201	30	3	105000
200212	45	12	133000
143695	27	1	80000

In [6]:

employees = [
    [100320,36,5,110000],
    [132201,30,3,105000],
    [100212,45,12,133000],
    [143695,27,1,80000]
]

for employee in employees:
    print(employee)

[100320, 36, 5, 110000]
[132201, 30, 3, 105000]
[100212, 45, 12, 133000]
[143695, 27, 1, 80000]

So, each employee in the data is a list of four values, and the whole data set is a list of employees. So far okay, right? Sure, if you just want to store data. The issue begins when you want to work with it. For examples, to get the column Age, you need to write a loop. To calculate the mean of Age, you need to use two functions.

In [4]:

age = []

for row in employees:
    age.append(row[1])

print(age)

[36, 30, 45, 27]

In [5]:

sum(age) / len(age)

Out[5]:

34.5

What happened in the cells above? In the first one, I have to manually iterate through each row in the data to append each one’s age to an empty list age. Then, I have to manually calculate the mean of age using the function sum() and len() (len() gives the numbers of items in a collection). Pretty inconvenient right? And those are just two very simple data operations.

For such reasons, we do not write everything ourselves. Instead, we utilize tools that have been developed and tested. And, NumPy is the first one we will discuss.

Basic NumPy arrays

As NumPy is an external library, you need to install it. If you follow my post on setting up your Python workbench, you should have NumPy already. Of course, to use a library, we need to first import it in the code. When importing, we can give NumPy an alias, for example np, using the as key word. NumPy can then be referred to with np instead of its full name.

In [11]:

import numpy as np

Creating the data set from before using NumPy is fairly similar to doing so with a list. We will use np.array() which stores data in an array object (one of the collection types in NumPy). Immediately, you can already see the difference just from printing. To display the list data nicely, I have to write a loop. A NumPy array, on the other hand, will appear nicely by itself, and the columns even align!

In [10]:

employees = np.array([
    [100320,36,5,110000],
    [132201,30,3,105000],
    [100212,45,12,133000],
    [143695,27,1,80000]
])

employees

Out[10]:

array([[100320,     36,      5, 110000],
       [132201,     30,      3, 105000],
       [100212,     45,     12, 133000],
       [143695,     27,      1,  80000]])

A very useful property from an array is shape which stores the number of rows and columns in the data. To use shape, we call it from the array variable (notice the dot . that represents shape belonging to an array object)

In [12]:

employees.shape

Out[12]:

(4, 4)

You do not have to manually type all the data entries to create an array. It is just very tedious (small data sets still have a few hundred rows). Instead, you can load the contents from a data file. However, this is usually done in combinations with Pandas, a Python library for data manipulation. So, we will get back to reading data from files by then.

Slicing NumPy arrays

Remember the slicing of lists? We can also slice a NumPy array, but with much more flexibility. The syntax to slice an array that stores tabular data is as follows. row slice represents how we want to select the rows, and column slice represents how to select the columns. column slice can be omitted, in which case we only select rows.

In [ ]:

array[<row slice>, <column slice>]

The easiest way to write slices is to use index. Similar to lists, indexes in array are the positional numbers of the rows or columns, starting from 0. To get slice multiple row or column indexes, we put them in a list. For example, the cells below select the first row (index 0), and the first and forth (index 3) rows in the array.

In [13]:

employees[0]

Out[13]:

array([100320,     36,      5, 110000])

In [14]:

employees[[0,3]]

Out[14]:

array([[100320,     36,      5, 110000],
       [143695,     27,      1,  80000]])

Unlike rows, if you want to slice columns only, you need to replace row slice with a colon :. And we can certainly slice both rows and columns at the same times. Below, I select column 0, columns 1 and 3, and finally, rows 0 to 2, and columns 1 and 3, in the array.

In [15]:

employees[:,0]

Out[15]:

array([100320, 132201, 100212, 143695])

In [16]:

employees[:,[1,3]]

Out[16]:

array([[    36, 110000],
       [    30, 105000],
       [    45, 133000],
       [    27,  80000]])

In [17]:

employees[:3,[1,3]]

Out[17]:

array([[    36, 110000],
       [    30, 105000],
       [    45, 133000]])

A bit more advance, we can slice arrays using conditions. For example, get all rows whose ages are above 32, or get all rows who salaries are below $120,000. The syntax for these kinds of conditions is array[:,<column index>] <compare> <value> where column index is a single index number, compare is one among the operator <, <=, ==, >=, >, !, and <value> is the referencing value. Below, the code in the first cell select rows whose ages are above 30, and the second cell selects rows whose salaries are below $110,000.

In [18]:

employees[employees[:,1] > 30]  #rows in which age (2nd column) > 30

Out[18]:

array([[100320,     36,      5, 110000],
       [100212,     45,     12, 133000]])

In [19]:

employees[employees[:,-1] < 110000]  #rows in which salary (last column) < 110000

Out[19]:

array([[132201,     30,      3, 105000],
       [143695,     27,      1,  80000]])

What’s next?

In this post, I have introduced some basic usages of NumPy arrays including creating and slicing them. The functionalities of NumPy and the array types are, however, much more than that, which we will step-by-step explore. I will stop this post here, and discuss operations with NumPy arrays in the next one. See you again!

NumPy Arrays

Why NumPy

Basic NumPy arrays

Slicing NumPy arrays

What’s next?

3 Comments