Why Decision Trees Fail (and How to Fix Them)




In this article, you will learn why decision trees sometimes fail in practice and how to correct the most common issues with simple, effective techniques.

Topics we will cover include:

  • How to spot and reduce overfitting in decision trees.
  • How to recognize and fix underfitting by tuning model capacity.
  • How noisy or redundant features mislead trees and how feature selection helps.

Let’s not waste any more time.

Why Decision Trees Fail (and How to Fix Them)
Image by Editor

 

Decision tree-based models for predictive machine learning tasks like classification and regression are undoubtedly rich in advantages — such as their ability to capture nonlinear relationships among features and their intuitive interpretability that makes it easy to trace decisions. However, they are not perfect and can fail, especially when trained on datasets of moderate to high complexity, where issues like overfitting, underfitting, or sensitivity to noisy features typically arise.

In this article, we examine three common reasons why a trained decision tree model may fail, and we outline simple yet effective strategies to cope with these issues. The discussion is accompanied by Python examples ready for you to try yourself.

1. Overfitting: Memorizing the Data Rather Than Learning from It

Scikit-learn‘s simplicity and intuitiveness in building machine learning models can be tempting, and one may think that simply building a model “by default” should yield satisfactory results. However, a common problem in many machine learning models is overfitting, i.e., the model learns too much from the data, to the point that it nearly memorizes every single data example it has been exposed to. As a result, as soon as the trained model is exposed to new, unseen data examples, it struggles to correctly figure out what the output prediction should be.

This example trains a decision tree on the popular, publicly available California Housing dataset: this is a common dataset of intermediate complexity and size used for regression tasks, namely predicting the median house price in a district of California based on demographic features and average house characteristics in that district.

Note that we trained a decision tree-based regressor without specifying any hyperparameters, including constraints on the shape and size of the tree. Yes, that will have consequences, namely a drastic gap between the nearly zero error (notice the scientific notation e-16 below) on the training examples and the much higher error on the test set. This is a clear sign of overfitting.

Output:

To address overfitting, a frequent strategy is regularization, which consists of simplifying the model’s complexity. While for other models this entails a somewhat intricate mathematical approach, for decision trees in scikit-learn it is as simple as constraining aspects like the maximum depth the tree can grow to, or the minimum number of samples that a leaf node should contain: both hyperparameters are designed to control and prevent possibly overgrown trees.

Overall, the second tree is preferred over the first, even though the error in the training set increased. The key lies in the error on the test data, which is normally a better indicator of how the model might behave in the real world, and this error has indeed decreased relative to the first tree.

2. Underfitting: The Tree Is Too Simple to Work Well

At the opposite end of the spectrum relative to overfitting, we have the underfitting problem, which essentially entails models that have learned poorly from the training data so that even when evaluating them on that data, the performance falls below expectations.

While overfit trees are normally overgrown and deep, underfitting is usually associated with shallow tree structures.

One way to address underfitting is to carefully increase the model complexity, taking care not to make it overly complex and run into the previously explained overfitting problem. Here’s an example (try it yourself in a Colab notebook or similar to see results):

And a version that reduces the error and alleviates underfitting:

3. Misleading Training Features: Inducing Distraction

Decision trees can also be very sensitive to features that are irrelevant or redundant when put together with other existing features. This is associated with the “signal-to-noise ratio”; in other words, the more signal (valuable information for predictions) and less noise your data contains, the better the model’s performance. Imagine a tourist who got lost in the middle of the Kyoto Station area and asks for directions to get to Kiyomizu-dera Temple — located several kilometres away. Receiving instructions like “take bus EX101, get off at Gojozaka, and walk the street leading uphill,” the tourist will probably get to the destination easily, but if she is told to walk all the way there, with dozens of turns and street names, she might end up lost again. This is a metaphor for the “signal-to-noise ratio” in models like decision trees.

A careful and strategic feature selection is typically the way to go around this issue. This slightly more elaborate example illustrates the comparison among a baseline tree model, the intentional insertion of artificial noise in the dataset to simulate poor-quality training data, and the subsequent feature selection to enhance model performance.

If everything went well, the model built after feature selection should yield the best results. Try playing with the k for feature selection (set as 20 in the example) and see if you can further improve the last model’s performance.

Conclusion

In this article, we explored and illustrated three common issues that may lead trained decision tree models to behave poorly: from underfitting to overfitting and irrelevant features. We also showed simple yet effective strategies to navigate these problems.





Leave a Reply

Your email address will not be published. Required fields are marked *