Previously, we have talked about creating and slicing NumPy arrays. Now, let us see what else we can do with them. In short, a lot! The library has a huge amount of tools for numbers, vectors, matrices, tensors, etc. For tabular data analytics, there are also numerous useful NumPy operations that we can use. In this post, I will introduce the basic elementwise operations and some common functions in NumPy. You can access the complete codes in this notebook.
Basic elementwise NumPy operations
All the basic mathematical calculations in Python including +
, -
, *
, /
, %
, //
, and **
are applicable to NumPy arrays. Similarly, comparisons that are <
, <=
, ==
, >=
, >
, !=
are also usable. All of these operations are elementwise, meaning that all items in the arrays go through the same calculations and generate an array of all the results. You can see an illustration in the following figure.
For examples (by the way, do not forget to import and alias NumPy in your new Jupyter session), you can see that +10
, -5
, and **2
are performed on the whole array as below.
import numpy as np
an_array = np.array([1, 5, 4, 9 ,7])
an_array + 10
array([11, 15, 14, 19, 17])
an_array - 5
array([-4, 0, -1, 4, 2])
an_array ** 2
array([ 1, 25, 16, 81, 49])
Likewise, comparisons between an array and an individual number result in each item going through the same expressions. The result is now an array of the Boolean type.
an_array > 5
array([False, False, False, True, True])
an_array < 5
array([ True, False, True, False, False])
an_array == 5
array([False, True, False, False, False])
You can also have all the previously mentioned operations between two arrays. In the simplest case, the two arrays must have the same size (same numbers of rows and columns). Generally, this condition is not that strict due to NumPy broadcasting, but this is the topic for another day. For now, let us assume that the two arrays have the same sizes. This type of NumPy operations then generates a new array in which each item is the result from items at the same positions in the two inputs. An illustration and examples are below.
array1 = np.array([5,1,8,6,2])
array2 = np.array([7,4,3,0,5])
array1 + array2
array([12, 5, 11, 6, 7])
array1 * array2
array([35, 4, 24, 0, 10])
array1 > array2
array([False, False, True, True, False])
Functions in NumPy Operations
Besides regular operations, NumPy also provides a big collections of functions from power, logarithm, to arithmetic, to trigonometrical, and much more. You can find a complete list in the library’s documentation. Some of the most common functions from my perspective are log()
, exp()
, sum()
, mean()
, std()
, var()
. All these functions come from NumPy, so we use the library’s name or alias to call them. For examples,
an_array = np.array([6,4,3,5,2])
np.log(an_array)
array([1.79175947, 1.38629436, 1.09861229, 1.60943791, 0.69314718])
np.sin(an_array)
array([-0.2794155 , -0.7568025 , 0.14112001, -0.95892427, 0.90929743])
In short, trigonometrical and mathematical functions are elementwise – they yield an array having the results from applying the function on each item as you can see above with log()
and sin()
. On the other hand, statistical functions summarize the inputs and generate the result for the whole array, or each row/column in a smaller array. I will first showcase these functions applying on the whole inputs.
np.sum(an_array)
20
np.mean(an_array)
4.0
np.median(an_array)
4.0
np.var(an_array)
2.0
np.std(an_array)
1.4142135623730951
The axis
option
To use these functions to obtain the statistics of items along rows or columns, we need to add an argument axis=
. axis=0
means the functions result one value for each column, and axis=1
yields one value for each row. Below are an illustration with mean()
and different options for axis
.
Now let us observe the code below. We can see that axis=0
generates four results with ranges similar to that of the four columns, so these are their means. In contrast, axis=1
creates five fairly similar numbers, showing that they are the mean for each row. You can do the same thing with the other statistical functions, so do try that out.
data = np.array([
[3, 20, 100, 392],
[2, 14, 89, 453],
[5, 11, 153, 412],
[1, 24, 121, 312],
[3, 22, 90, 431]
])
np.mean(data, axis=0)
array([ 2.8, 18.2, 110.6, 400. ])
np.mean(data, axis=1)
array([128.75, 139.5 , 145.25, 114.5 , 136.5 ])
Applications in data analytics
Needless to say, all the operations and functions that I mentioned, in combinations with array slicing, are widely use throughout data analysis. So, let me give a small data example and see what we can do with it. The data below contains three years of GPAs of several students:
Student ID | First year GPA | Second year GPA | Current GPA |
0001252 | 3.12 | 3.31 | 3.54 |
0003215 | 2.57 | 2.59 | 2.55 |
0002324 | 2.39 | 2.78 | 3.11 |
0001012 | 3.21 | 2.91 | 2.73 |
0002151 | 3.52 | 3.55 | 3.62 |
First, we create the data as a NumPy array. You can notice that I do not include student IDs here because they are not of interests at the moment.
gpas = np.array([
[3.12, 3.31, 3.54],
[2.57, 2.59, 2.55],
[2.39, 2.78, 3.11],
[3.21, 2.91, 2.73],
[3.52, 3.55, 3.62]
])
We will talk about exploratory analysis later, but means and standard deviations of columns are always nice to look at to have a general ideas of their values. We can now do that using NumPy. An example on how to interpret them is that the average GPA of students in the first year is 2.962
, and on average, a student’s GPA deviate within 0.419
from 2.962
. By the way, we can use functions as input to other functions as you can see here. I put these two in print()
so that their outputs both show up in the same cell.
print('means of GPAs:', gpas.mean(axis=0))
print('standard deviations of GPAs:',gpas.std(axis=0))
means of GPAs: [2.962 3.028 3.11 ] standard deviations of GPAs: [0.41920878 0.35193181 0.42497059]
To look at the changes in GPAs of the students throughout the years, we slice the columns and take their differences. For examples, the changes from year 1 to year 2, and year 2 to year 3. We can see which students made the most improvement, or who lost performances.
gpas[:,1] - gpas[:,0]
array([ 0.19, 0.02, 0.39, -0.3 , 0.03])
gpas[:,2] - gpas[:,1]
array([ 0.23, -0.04, 0.33, -0.18, 0.07])
Finally, a log transformation is very commonly applied on data. We can perform that here very easily.
np.log(gpas)
array([[1.137833 , 1.19694819, 1.26412673], [0.9439059 , 0.95165788, 0.93609336], [0.87129337, 1.02245093, 1.13462273], [1.16627094, 1.06815308, 1.00430161], [1.25846099, 1.2669476 , 1.28647403]])
Conclusion
In this post, I briefly introduce the basics of NumPy elementwise operations and some common functions. Of course, the library can do much more. Next, we will discuss the concept of concatenation. So, see you there!
Pingback: NumPy Arrays - Data Science from a Practical Perspective
Pingback: Array Concatenation - Data Science from a Practical Perspective