an illustration of a decision tree pipeline

So far, we have discussed decision trees in fair details, including how they split, how they apply to regression, and their various hyperparameters. While there are certain technical details that we omitted, the mentioned materials are certainly enough for us to start writing a decision tree pipeline. As any other models, we can easily build pipelines for either classification and regression now. So, let us dive in.

Data and Preprocessing

Like in previous pipeline posts, I will keep using the heart disease data and auto-mpg data for demonstration of classification and regression. Preprocessing is also identical, so there are no points to show them here. The only thing to note is that the final object in this step is processing_pipeline which we will combine with the model in the next step to form the complete pipeline.

In [ ]:

process_pipeline = ColumnTransformer([
    ('numeric', num_pipeline, num_cols),
    ('class', cat_pipeline, cat_cols)
])

Decision Tree Pipeline

After preprocessing, we can now write our modeling pipeline. We will tune all the hyperparameters discussed previously, including max_depth, min_samples_split, min_sample_leaf, max_features, and max_leaf_nodes. All five hyperparameters are integer numbers, so we can either hard-code them like in max_depth or max_leaf_nodes, or base them on the data size like in the other three. The decision tree pipeline below is for classification, so note that the scoring that I use is accuracy. We can change to f1_score if the data is unbalance.

In [ ]:

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

dtc = Pipeline([
    ('processing', process_pipeline), 
    ('dtc', DecisionTreeClassifier())
])

data_size = train.shape[0]
n_features = process_pipeline.fit_transform(train).shape[1]

param_grid = {
    'dtc__max_depth' : [3, 4, 5],
    'dtc__min_samples_split' : [data_size//20, data_size//15, data_size//10, data_size//5],
    'dtc__min_samples_leaf' : [data_size//20, data_size//15, data_size//10, data_size//5],
    'dtc__max_features' : [n_features//4, n_features//3, n_features//2, n_features],
    'dtc__max_leaf_nodes' : [5, 10, 15, 20]
}


grid_search = GridSearchCV(dtc, param_grid, cv=5, scoring='accuracy', return_train_score=True)

The decision tree pipeline for regression is almost identical to the classification one except for the tree class which is now regressor, and the scoring, which should be r2. Also, you can see I use decimal numbers as values of min_samples_split, min_sample_leaf, and max_features now. This is an alternative way to set these hyperparameters dynamic to the data size, i.e., 0.2 means the number is 20% of number of rows or columns. Finally, the None value represents having no limits, i.e., using all features or uncapped numbers of leaf nodes.

In [ ]:

from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor

dtr = Pipeline([
    ('processing', process_pipeline),
    ('dtr',DecisionTreeRegressor())
])

param_grid = {
    'dtr__max_depth' : [3, 4, 5],
    'dtr__min_samples_split' : [0.05, 0.1, 0.2, 0.3],
    'dtr__min_samples_leaf' : [0.05, 0.1, 0.2, 0.3],
    'dtr__max_features' : [0.25, 0.5, 0.75, None],
    'dtr__max_leaf_nodes' : [5, 10, 15, None]
}

grid_search = GridSearchCV(dtr, param_grid, cv=5, scoring='r2', return_train_score=True)

After this part, we can fit() our pipelines then use them to score() or predict() like any other modeling pipelines.

Conclusion

This post wraps up the contents on decision tree models. In the next one, we will discuss ensembles of trees which are basically… many trees, instead of one. So, see you again!

Data and Preprocessing

Decision Tree Pipeline

Conclusion

1 Comment