Previously, we have had an introduction on the concept of a decision tree and its operations. As discussed, essentially, each internal node in a tree divide its input data into different portions. Now, let us investigate decision tree splits – how they actually happen and their impact on the prediction process of the model. Since categorical data needs to be encoded as numeric data before training models, we will only consider the latter. So, let us start! (The complete notebook for this post is available here, by the way.)
Individual splits
To start, we look at a single split in a tree. The small data set in use has two columns, study_time
and passed
of several students. study_time
represents the number of hours the students studied for a test, and passed
whether they passed (1
) or failed (0
) the test. I will just show the scatter plot of the data. Like any other models, we can train a decision tree by creating an empty one then call fit()
. However, we will go into details of codes in the next post. For now, let us focus on understanding the tree models.
The figure below is a tree plot of the fitted model on the data drawn by SKLearn using the function plot_tree()
. In the root node, we can first see the split criterion, x[0] <= 1.9
. x[0]
represents the first feature in the input of the node which is study_time
, the only feature. Accordingly, data with study_time ≤ 1.9
flows to a different node than those with study_time > 1.9
. Next, samples = 12
indicates there are 12 instances enter this nodes, and value = [6, 6]
means there are six belong to the first class, and six, the second class.
From the root node, we then have two flows to two nodes which are leaves due to the lacks of splitting criteria. In SKLearn tree plots, the left flow is the True
direction, and the right flow, False
. Overall, six students (samples = 6
) with study_time ≤ 1.9
reach the left leaf nodes and get assigned with the first class (value = [6, 0]
). Similarly, six students having study_time > 1.9
get to the right leaf node and are assigned the second class.
Finally, the split criterion study_time ≤ 1.9
is represented by a straight line in the scatter plot below. In general, decision tree splits always divide data using a single feature. In 2D data, these splits are lines parallel to either axis. With more dimensions, they become hyperplanes, however, still parallel to one of the axes.
Decision tree with multiple splits
In reality, data is surely more complex, and so are tree models. So, let us continue with a slightly more complicated case. The data is as below. We are still classifying whether students passed or failed an exam, however, based on two features now, reading time
, and practicing time
.
The tree plot and scatter plot with splits for the data are as follows. In this case, besides the root node, we see two internal nodes. The root node split using x[0]
which is reading time
, and the other two use x[1]
which is practicing time
. You can observe that after the first split, the two data subsets still consist of instances from both classes (value = [8, 1]
and value = [2, 9]
). Therefore, the tree continue with another two splits to completely separate the classes. On the scatter plot, split 1
represents the root node, split 2
, the left internal node, and, split 3
, the right one. The combination of all three splits successfully divides the data by classes.
In the tree structure above, we can also observe that, each tree split yields a “purer” data subset than its predecessor. This means, data enters a node becomes more dominant of a class the further it is down the tree. In general, this is how decision trees are trained. At each node, the tree finds a split that best separates classes in its input data.
Back to visualization, SKLearn also provide a DecisionBoundaryDisplay()
function that draw and color the models’ decision boundaries. Applying this function on the previous data and tree results in the below figure. However, note that, like the regular scatter plot, you can only use this function on data with two features.
DecisionBoundaryDisplay()
is also applicable to models like logistic and SVM. As can be seen, the decision boundary of a decision tree is very different from the other two. The main reason is due to this boundary comprises of multiple single-feature threshold instead of being a compound function.
Logistic regression
Support Vector Machine
Non-linearly separable cases
Decision trees are very flexible and can model highly nonlinear patterns in data. Let us get start with the first example in the following scatter plot. As you can see, the classes are separable, however, non-linearly. Instances in one class cluster in the middle, that of the other, on the two sides.
The tree and its decision boundaries for this data is as below. We can see that this tree grows a bit compared to earlier data. Furthermore, you should also note that a single feature can be used in multiple splits. In this case, x[1]
is used in two nodes.
Of course, more complicated data needs more complex trees. The next example is for a data set that has a very non-linear separation between the two classes. We can see that the tree is even bigger, and there are quite a few more splits in decision boundary.
Decision trees and overfitting
A tree can grow indefinitely until it perfectly model the training data. This makes decision trees especially susceptible to overfitting without proper controls. Let us take a look at the next example. The data is fairly similar to the previous one, however, with some areas having non-separable instances from the two classes.
Fitting a decision tree on this data results in an overly complicated model as you can see below. Even perfectly separating the two classes, the decision boundary becomes very unnatural. It is highly unlikely that any data has these class boundaries in practice. In this case, our tree overfits the data.
Like any other models, the reason of overfitting is mainly because we let the tree grow indefinitely without any restrictions. We will discuss tuning trees in the next post. Regardless, as an illustration, I will show a tree with limited max_depth
. In brief, a tree’s depth is the distance (number of flows) between the root and the furthest leaf node. In this case, I set the max_depth
of the tree to 3
, and get the below result. As you can see, the tree becomes much simpler, and its decision boundary “feels more natural” albeit with some misclassifications.
Conclusion
In this post, we have discussed decision trees in quite a bit more details. Specifically, we investigate how a tree split actually is, and how a tree models different patterns in data. Please do note that, we stick to 2D data in this post only for visualization purpose. Surely a tree can work with more dimensions! Finally, trees without tuning are very weak to overfitting. In the next post, we will build decision tree tuning pipeline for real data. Until next time!