Tuning Decision Trees

An illustration of hyperparameters in tuning decision trees

I have been fairly busy, so we are stuck on classification and regression trees for some months. So let us get back on track. In tuning decision trees, we need to understand the many hyperparameters that decision trees have, including
– Max depth
– Min samples split
– Min samples leaf
– Max features
– Max leaf nodes

While we are still not directly working with codes at the moment, you can access the codes to draw all the figures here.

Example Data for Tuning Decision Trees

For illustration, we will reuse a small data set from earlier. It simply has two input features as well as a categorical target. With colors denoting the two classes in our targets, the scatter plot of data is as

data used in examples

Using sklearn, the default tree looks like below. Recall, this is the fully-grown tree that reaches 100% accuracy.

the fully-grown tree

Next, let us observe how each hyperparameter affects our tree.

Max Depth

The depth of a tree is the number of branches (connections) from the root to the furthest leaf node. So, obviously, setting a max depth for a tree means that it cannot grow more than that distance threshold. Lower depths force the tree to be less complex, and higher depths let the tree grow with less restrictions. This hyperparameter takes effects very clearly, for example at max_depth = 1 and max_depth = 3, the trees become like follow.

max_depth = 1

tree with max_depth = 1

max_depth = 3

tree with max_depth = 3

Min Samples Split

Splits are the core mechanism for tree models to operate. A node splits data that come into it to different flows that lead each portion to a new node, usually with better predictive power. The min_samples_split hyperparameter determine the minimum number of samples in a node so that it can become a split. For example, with min_samples_split = 10, only nodes with more than 10 samples can continue splitting, nodes with below 10 must remain leaves regardless of their predictive capabilities.

In our small data set, the tree with min_samples_split = 5 is

tree with min_samples_split = 5

Whereas the one with min_samples_split = 10

tree with min_samples_split = 10

You can examine the two trees to verify that in either cases, all internal nodes have no less than 5 or 10 samples. Lower min_samples_split let the tree grow more complicated as it allows more specific splits, while higher values of this parameter constrain the trees at lower complexity. However, we can see that this hyperparameter does not affect the tree structure as obviously as the max_depth. That is not to say it is less important though!

Min Samples Leaf

While sounding like min_samples_split, min_samples_leaf is rather different in that it decides the minimum size of a leaf node that is allowed. More specifically, no leaves in a tree can have less samples than the selected min_samples_leaf. Though, similar to the previous hyperparameter, lower min_samples_leaf allows more complicated trees as the leaves can grow more freely. On the other hand, higher values in this hyperparameter leads to more regularized trees. You can check the sample sizes in the leaves of the two trees below to verify its impact.

min_samples_leaf = 5

tree with min_sample_leaf = 5

min_sample_leaf = 10

tree with min_sample_leaf = 10

Max Leaf Nodes

This hyperparameter is very self-descriptive. It sets the highest number of leaves that the tree can have. Obviously, higher max_leaf_nodes means less restricted trees, and vice versa. The impact of this hyperparameter is also quite clear and easy to verify.

max_leaf_nodes = 5

tree with max_leaf_nodes = 5

max_leaf_nodes = 8

tree with max_leaf_nodes = 8

Max Features

max_features constrains the tree in terms of at most how many input features it can look at while learning. However, as sklearn specifies, “the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features”. This means that too small max_features may not take effects at all, since the tree will use other features if it cannot use those it currently considers. Our data is a good example for this case. The tree below still uses both inputs even with max_features = 1.

too small max_features may not take effects

Conclusion

In this post, we have discussed several hyperparameters that are important in tuning decision trees. While sklearn still have a few others, I feel that they are more specialized which are not necessary for now. In my next post, we will write pipelines for decision trees. So, see you again!