So far, we have discussed decision trees in fair details, including how they split, how they apply to regression, and their various hyperparameters. While there are certain technical details that we omitted, the mentioned materials are certainly enough for us to start writing a decision tree pipeline. As any other models, we can easily build pipelines for either classification and regression now. So, let us dive in.
Data and Preprocessing
Like in previous pipeline posts, I will keep using the heart disease
data and auto-mpg
data for demonstration of classification and regression. Preprocessing is also identical, so there are no points to show them here. The only thing to note is that the final object in this step is processing_pipeline
which we will combine with the model in the next step to form the complete pipeline.
process_pipeline = ColumnTransformer([
('numeric', num_pipeline, num_cols),
('class', cat_pipeline, cat_cols)
])
Decision Tree Pipeline
After preprocessing, we can now write our modeling pipeline. We will tune all the hyperparameters discussed previously, including max_depth
, min_samples_split
, min_sample_leaf
, max_features
, and max_leaf_nodes
. All five hyperparameters are integer numbers, so we can either hard-code them like in max_depth
or max_leaf_nodes
, or base them on the data size like in the other three. The decision tree pipeline below is for classification, so note that the scoring
that I use is accuracy
. We can change to f1_score
if the data is unbalance.
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
dtc = Pipeline([
('processing', process_pipeline),
('dtc', DecisionTreeClassifier())
])
data_size = train.shape[0]
n_features = process_pipeline.fit_transform(train).shape[1]
param_grid = {
'dtc__max_depth' : [3, 4, 5],
'dtc__min_samples_split' : [data_size//20, data_size//15, data_size//10, data_size//5],
'dtc__min_samples_leaf' : [data_size//20, data_size//15, data_size//10, data_size//5],
'dtc__max_features' : [n_features//4, n_features//3, n_features//2, n_features],
'dtc__max_leaf_nodes' : [5, 10, 15, 20]
}
grid_search = GridSearchCV(dtc, param_grid, cv=5, scoring='accuracy', return_train_score=True)
The decision tree pipeline for regression is almost identical to the classification one except for the tree class which is now regressor, and the scoring, which should be r2
. Also, you can see I use decimal numbers as values of min_samples_split
, min_sample_leaf
, and max_features
now. This is an alternative way to set these hyperparameters dynamic to the data size, i.e., 0.2
means the number is 20%
of number of rows or columns. Finally, the None
value represents having no limits, i.e., using all features or uncapped numbers of leaf nodes.
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
dtr = Pipeline([
('processing', process_pipeline),
('dtr',DecisionTreeRegressor())
])
param_grid = {
'dtr__max_depth' : [3, 4, 5],
'dtr__min_samples_split' : [0.05, 0.1, 0.2, 0.3],
'dtr__min_samples_leaf' : [0.05, 0.1, 0.2, 0.3],
'dtr__max_features' : [0.25, 0.5, 0.75, None],
'dtr__max_leaf_nodes' : [5, 10, 15, None]
}
grid_search = GridSearchCV(dtr, param_grid, cv=5, scoring='r2', return_train_score=True)
After this part, we can fit()
our pipelines then use them to score()
or predict()
like any other modeling pipelines.
Conclusion
This post wraps up the contents on decision tree models. In the next one, we will discuss ensembles of trees which are basically… many trees, instead of one. So, see you again!
Pingback: Tree Ensemble - Data Science from a Practical Perspective