After multiple discussions on decision trees, we are almost done with this model family. Decision trees are very flexible, but are also fairly unstable in that their performances vary quite a lot across different runs on the same training data. You can rerun the previous tree pipelines to to observe this phenomenon in both classification and regression. One way to address this instability is to use a tree ensemble instead of a single tree. So, let us talk about this model in this post.

Tree Ensemble

Just like how it sounds, a tree ensemble is a group of individual trees. When making decisions, the input data is fed to the ensemble to obtain one prediction from each tree. The model then aggregates the predicted labels to obtain the final prediction. In classification, this can be done in a “voting” manner, i.e., the label with the most predictions is selected. For regression, aggregation can be as simple as getting the average of all the predicted values.

To train the ensemble, there are multiple strategies. The most common ones are random forests and gradient boosting models. Both methods are available in sklearn. Being ensembles of trees, these two models have all the hyperparameters of individual trees, as well as an additional one called n_estimators which represents the number of trees. Next, we will discuss each ensemble in more details. In my GitHub, you will find the complete notebooks for training tree ensembles for classification and regression.

Random Forest

A random forest is a tree ensemble of independent models of which each is trained on a random subset of data. Here, the subset is randomized in both rows and columns. For each subset, a decision tree is trained exactly like we discussed in the previous posts. The trained trees then form the random forest model, which makes decisions like presented earlier.

In term of codes, the pipeline for a random forest is practically identical to that of a decision tree. The only differences are in the model class (RandomForestClassifier or RandomForestRegressor), and the addition of the n_estimators hyperparameter in the value grid. An example of the forest pipeline for classification is as below.

In [6]:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

rfc = Pipeline([
    ('processing', process_pipeline), 
    ('rfc', RandomForestClassifier())
])

data_size = train.shape[0]
n_features = process_pipeline.fit_transform(train).shape[1]

param_grid = {
    'rfc__n_estimators' : [25, 50, 100, 200],
    'rfc__max_depth' : [3, 4],
    'rfc__min_samples_split' : [0.05, 0.1, 0.2, 0.3],
    'rfc__min_samples_leaf' : [0.05, 0.1, 0.2, 0.3],
    'rfc__max_features' : [0.25, 0.5, 0.75, None],
    'rfc__max_leaf_nodes' : [5, 10, 20, None]
}


grid_search = GridSearchCV(rfc, param_grid, cv=5, scoring='accuracy', return_train_score=True)

Gradient Boosting Model

Unlike random forests, a gradient boosting model builds its trees not in a random manner, but sequentially to improve current model performance. More specifically, in the first iteration, the model trains a single tree to obtain the best current error. In the second iteration, the model seeks a second tree so that the new error is better in than the first iteration, and is the best overall. The process repeats in the same selection criterion until reaching n_estimators trees.

Writing a pipeline for gradient boosting model is identical to that of random forests. The only thing to change is the model class, which is now GradientBoostingClassifier or GradientBoostingRegressor. An example for the modeling pipeline for regression is as below.

In [9]:

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV

gbr = Pipeline([
    ('processing', process_pipeline), 
    ('gbr', GradientBoostingRegressor())
])

data_size = train.shape[0]
n_features = process_pipeline.fit_transform(train).shape[1]

param_grid = {
    'gbr__n_estimators' : [25, 50, 100, 200],
    'gbr__max_depth' : [3, 4],
    'gbr__min_samples_split' : [0.05, 0.1, 0.2, 0.3],
    'gbr__min_samples_leaf' : [0.05, 0.1, 0.2, 0.3],
    'gbr__max_features' : [0.25, 0.5, 0.75, None],
    'gbr__max_leaf_nodes' : [5, 10, 20, None]
}


grid_search = GridSearchCV(gbr, param_grid, cv=5, scoring='r2', return_train_score=True)

Conclusion

With tree ensemble, we will now wrap up the contents on the family of tree models. Unless you have specific reason to use individual trees (e.g., interpretability), an ensemble will most likely provide better performances and stability. From the next posts, we will start discussing neural network model, so stay tuned!