Decision Trees

What

A flowchart-like model that splits data on feature thresholds to make predictions.

Is age > 30?
├── Yes: Is income > 50k?
│   ├── Yes → approve loan
│   └── No  → deny loan
└── No:  Is employed?
    ├── Yes → approve loan
    └── No  → deny loan

Why they matter

  • Interpretable: you can read and explain the decision logic
  • No scaling needed: don’t care about feature ranges
  • Handle mixed types: numeric and categorical features
  • Foundation for Random Forests and Gradient Boosting

How splits work

At each node, find the feature + threshold that best separates the data:

  • Classification: maximize information gain (reduce Entropy) or Gini impurity
  • Regression: minimize MSE of the resulting groups

The overfitting problem

An unrestricted tree will keep splitting until every leaf has one sample → perfect training accuracy, terrible generalization. Solutions:

  • max_depth: limit tree depth
  • min_samples_split: require minimum samples to split
  • min_samples_leaf: require minimum samples in each leaf
  • Pruning: grow full tree, then cut branches that don’t help
from sklearn.tree import DecisionTreeClassifier
 
model = DecisionTreeClassifier(max_depth=5, min_samples_leaf=10)
model.fit(X_train, y_train)