I have been fairly busy, so we are stuck on classification and regression trees for some months. So let us get back on track. In tuning decision trees, we need to understand the many hyperparameters that decision trees have, including
– Max depth
– Min samples split
– Min samples leaf
– Max features
– Max leaf nodes
While we are still not directly working with codes at the moment, you can access the codes to draw all the figures here.
Example Data for Tuning Decision Trees
For illustration, we will reuse a small data set from earlier. It simply has two input features as well as a categorical target. With colors denoting the two classes in our targets, the scatter plot of data is as
Using sklearn
, the default tree looks like below. Recall, this is the fully-grown tree that reaches 100% accuracy.
Next, let us observe how each hyperparameter affects our tree.
Max Depth
The depth of a tree is the number of branches (connections) from the root to the furthest leaf node. So, obviously, setting a max depth for a tree means that it cannot grow more than that distance threshold. Lower depths force the tree to be less complex, and higher depths let the tree grow with less restrictions. This hyperparameter takes effects very clearly, for example at max_depth = 1
and max_depth = 3
, the trees become like follow.
max_depth = 1
max_depth = 3
Min Samples Split
Splits are the core mechanism for tree models to operate. A node splits data that come into it to different flows that lead each portion to a new node, usually with better predictive power. The min_samples_split
hyperparameter determine the minimum number of samples in a node so that it can become a split. For example, with min_samples_split = 10
, only nodes with more than 10 samples can continue splitting, nodes with below 10 must remain leaves regardless of their predictive capabilities.
In our small data set, the tree with min_samples_split = 5
is
Whereas the one with min_samples_split = 10
You can examine the two trees to verify that in either cases, all internal nodes have no less than 5 or 10 samples. Lower min_samples_split
let the tree grow more complicated as it allows more specific splits, while higher values of this parameter constrain the trees at lower complexity. However, we can see that this hyperparameter does not affect the tree structure as obviously as the max_depth
. That is not to say it is less important though!
Min Samples Leaf
While sounding like min_samples_split
, min_samples_leaf
is rather different in that it decides the minimum size of a leaf node that is allowed. More specifically, no leaves in a tree can have less samples than the selected min_samples_leaf
. Though, similar to the previous hyperparameter, lower min_samples_leaf
allows more complicated trees as the leaves can grow more freely. On the other hand, higher values in this hyperparameter leads to more regularized trees. You can check the sample sizes in the leaves of the two trees below to verify its impact.
min_samples_leaf = 5
min_sample_leaf = 10
Max Leaf Nodes
This hyperparameter is very self-descriptive. It sets the highest number of leaves that the tree can have. Obviously, higher max_leaf_nodes
means less restricted trees, and vice versa. The impact of this hyperparameter is also quite clear and easy to verify.
max_leaf_nodes = 5
max_leaf_nodes = 8
Max Features
max_features
constrains the tree in terms of at most how many input features it can look at while learning. However, as sklearn
specifies, “the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features
features”. This means that too small max_features
may not take effects at all, since the tree will use other features if it cannot use those it currently considers. Our data is a good example for this case. The tree below still uses both inputs even with max_features = 1
.
Conclusion
In this post, we have discussed several hyperparameters that are important in tuning decision trees. While sklearn
still have a few others, I feel that they are more specialized which are not necessary for now. In my next post, we will write pipelines for decision trees. So, see you again!