Array Concatenation

an illustration on numpy array concatenation. Tabular data can be concatenated either horizontally or vertically.

Now that we have known how a fair bit about NumPy array basics and operations, let us discuss another important operation – array concatenation. In short, concatenation means to join multiple arrays into a single one. Later on, we will see that there are more than one type of joining. Among those, concatenation is the simplest type in that arrays are just put together without any addition steps. So, let us start. You can download the notebook here.

Array dimensionality

an illustration of array dimensionality

So far, we have only been using NumPy arrays to store tabular data. This means the arrays are representing tables in a row-column. These are two-dimensional (2D) arrays. However, NumPy arrays can have any numbers of dimensions. For simplicity, I will not discuss the scientific definition of dimensionality. Instead, roughly speaking, the number of dimensions of an array is the number of directions that its items can expand. Take the image above as an example, an 1D array only expands in one direction and a 2D array expands both horizontally and vertically. Then, a 3D array adds another direction that expands its 2D structures. In other words, a 3D array consists of many 2D ones.

In analyzing tabular data, we mostly work with 1D and 2D arrays. If you work with images and videos, you will see arrays of up to five dimensions. I personally have not worked with any data above five dimensions.

In NumPy, we can easily get the dimensionality of an array using the shape property called from shape from its owner. Note that the dimensionality of an array is not the values that show in the results of shape, but rather the number of values that display. The values actually refer to the number of items in each dimension. Let us examine the example below. array1d.shape results in (5,) which only has one value, so array1d has one dimension which has five items. array2d has a shape of (2,3) which means that it has two dimensions including two rows and three columns. Finally, array3d has two tables of three rows and three columns. You can also see the differences in creating each type. The input to generate a 1D array is a single list, 2D array a list of lists, and 3D ones a list of 2D lists.

1D array concatenation

After understanding the concepts of dimensionality, concatenation is actually quite easy with NumPy. To concatenate 1D arrays, you just need to make sure that they are truly one-dimensional, then feed them to the concatenate() function from NumPy in a list. The arrays can have different sizes. The result is a new 1D array that has all items from all input arrays sorted in the order that they appear in each array and the order of the arrays in the input list. One example is as below.

1D array concatenation is usually used to combine 1D data like time series, or individual column extracted from some data sets.

2D array concatenation

Concatenating 2D arrays is somewhat more common that 1D ones. The reason is that this operation is widely used in combining multiple sources of data. In this case, we have two directions of concatenation, horizontally and vertically.

Horizontally concatenating 2D arrays

an illustration of horizontal array concatenation

We usually perform horizon concatenation to combine different feature sets of the same instances. In the example above, we have one array storing the students’ IDs and names, and another storing their academic information. Horizontally combining these two results in a new array that has both IDs, names, year, and GPA. It is very important to note that the data to combine must come from the exact same set of objects, otherwise, the result is meaningless. Back to the previous example, the joined data is only meaningful if rows in both arrays belong to Alice, Bob, and Carol, in that exact order. Later on with Pandas, we will have joining methods that use matching keys and does not require rows having the same order in different sources.

In NumPy, we add axis=1 to concatenate() to perform a horizontal join of 2D arrays. In this case, the input arrays must have the same number of rows — the first value in their shape. Below is one illustration.

Vertically concatenating 2D arrays

Contrasting to horizontal concatenation, vertical concatenation combines rows that have the same set of features into one single data set. For example, I have two lists of students as above, both consist of IDs, first names, and last names. A vertical join of these two will yield a new dataset with six students and the same three features. Here, you have to make sure the features are in the exact same order in the input data sets.

In NumPy, we write almost the same codes as horizontal joins and only change axis=0. In this case though, you have to verify all input arrays have the same numbers of columns – the second value in their shape. Please refer to the example below for references.

Conclusion

Concatenation is important in data analytics because you will have to combine multiple sources of data from times to times. While this is the simplest way of joining data, it is still useful, and could also be the only way in some situations. You can practice concatenations by modifying values in the examples I provided like adding more columns and rows, and observe the result arrays. With enough understanding on NumPy, we are now ready to move to Pandas!

1 Comment

Comments are closed