Decision Trees
What
A flowchart-like model that splits data on feature thresholds to make predictions.
Is age > 30?
├── Yes: Is income > 50k?
│ ├── Yes → approve loan
│ └── No → deny loan
└── No: Is employed?
├── Yes → approve loan
└── No → deny loan
Why they matter
- Interpretable: you can read and explain the decision logic
- No scaling needed: don’t care about feature ranges
- Handle mixed types: numeric and categorical features
- Foundation for Random Forests and Gradient Boosting
How splits work
At each node, find the feature + threshold that best separates the data:
- Classification: maximize information gain (reduce Entropy) or Gini impurity
- Regression: minimize MSE of the resulting groups
The overfitting problem
An unrestricted tree will keep splitting until every leaf has one sample → perfect training accuracy, terrible generalization. Solutions:
max_depth: limit tree depthmin_samples_split: require minimum samples to splitmin_samples_leaf: require minimum samples in each leaf- Pruning: grow full tree, then cut branches that don’t help
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=5, min_samples_leaf=10)
model.fit(X_train, y_train)Links
- Random Forests — ensemble of trees
- Gradient Boosting — sequential tree building
- Entropy — the math behind splits