Random Forests
What
An ensemble of decision trees that vote on the prediction. Each tree sees a random subset of data and features → diverse trees → robust predictions.
Why they work
- Individual trees overfit, but averaging many random trees cancels out the noise
- Bagging (Bootstrap Aggregating): each tree trains on a random sample with replacement
- Feature randomness: each split considers a random subset of features
In practice
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(
n_estimators=100, # number of trees
max_depth=10, # limit tree depth
min_samples_leaf=5, # prevent tiny leaves
n_jobs=-1, # use all CPU cores
)
model.fit(X_train, y_train)
# Feature importance — which features matter most
importances = model.feature_importances_Strengths and weaknesses
Strengths:
- Works well out of the box with minimal tuning
- Handles missing values, mixed feature types
- Gives feature importance for free
- Hard to overfit with enough trees
Weaknesses:
- Slow to train with many trees
- Not great for very high-dimensional sparse data (text)
- Can’t extrapolate beyond training data range
Links
- Decision Trees — the building block
- Gradient Boosting — the other major tree ensemble
- Hyperparameter Tuning