After spending some good posts about Support Vector Machine, let us move on to decision tree. This is another analytical model applicable to both classification and regression. Unlike the models we have learned so far, these models do not have an equation to represent their decision making process. Rather, they rely on a flow structure to derive the outcomes from the inputs. Trees are very interesting and are the basis of a number of ensemble models that we will discuss later on. Like SVM, we start with the easier case of decision trees for classification. With that being said, let us dive in!
An example of decision tree
We will start with a toy example to learn about decision trees with the small data below. There are six rows representing six families, and three columns, Family Income
, Family Size
, and Financial Standing
. The task here is to classify whether each families’ financial situation is Good
or Bad
.
A tree that classifies the given data can be as follows. In a logistic model or a support vector machine, you plug values for each feature into an equation to calculate the labels’ scores. Differently, instances fed into a decision tree follow a flow of nodes until they reach one that provides their predicted labels.
In the tree above, a family to be predicted would enter the tree through the Data in
entry. The first node that they encounter checks whether the families’ Income
is below $70,000
and redirect the instance to the appropriate direction. If the family indeed makes less than $70,000
annually, it follows the left (True
) branch, and otherwise, the right (False
) one. The node on the left branch check if the family Size < 2
, and the right one check Size < 4
. Depending on the outcome at either check, the family continues traversing to the next and final node that makes a prediction on whether their financial situation is Good
or Bad
. Below are two examples about making predictions for two instances with the tree model.
More about decision trees
Node types
There are different types of nodes in a decision tree:
- Leaf nodes are those that assign labels to data instances that reach them. These are terminal nodes in that there are no flows coming out of them. In my visualizations, leaf nodes are those with rounded corners.
- Internal nodes perform checks on specific features of the instances and redirect their flows. For example, when an instance reaches the node
Size < 2
, if itsSize
is indeed below2
, the instance follow theTrue
flow, otherwise, it follows theFalse
flow. The internal nodes are rectangles in my visualizations. - Root nodes are the first nodes of the trees in terms of data flow. All instances in the data must go through the tree root before reaching any other nodes. Each tree has a single root node.
Nodes and splits
We have been discussing decision trees as if they assign labels for individual instances. In practice, trees work on the given data set as a whole. We further refer to the internal nodes as splits since each of them divide the input data into smaller portions. Below is an example of the data coming into each nodes in the tree. The root node takes the complete data set and divide it into a portion with Income < 70000
and one with Income ≥ 70000
. The portion with income below 70k then goes through the next split on Size < 2
and end up at the leaf nodes for this branch. Similarly, data with income above 70k reaches the split on Size < 4
then leaf nodes for their predictions.
Tree and uniqueness
Decision tree solutions are not unique. This means that for the same data, we can have multiple tree structures that results in similar or the same performances. Below is a different tree model for our toy data that is also able to reach perfect classification.
For this reason, tree models are more random than other models that we discussed. You will get different performances from different runs on the same data. In the next posts, we will discuss ways to address this issue.
Conclusion
I hope this post is helpful for you to understand more about decision trees. Again, they are very interesting models, and also are the bases of many other powerful models that we will talk about later on. In the next post, we will get into hands-on with trees in Python and SKLearn.
Pingback: Decision Tree Splits in Classification - Data Science from a Practical Perspective
Pingback: Regression Tree - Data Science from a Practical Perspective