Random Forests

What

An ensemble of decision trees that vote on the prediction. Each tree sees a random subset of data and features → diverse trees → robust predictions.

Why they work

  • Individual trees overfit, but averaging many random trees cancels out the noise
  • Bagging (Bootstrap Aggregating): each tree trains on a random sample with replacement
  • Feature randomness: each split considers a random subset of features

In practice

from sklearn.ensemble import RandomForestClassifier
 
model = RandomForestClassifier(
    n_estimators=100,    # number of trees
    max_depth=10,        # limit tree depth
    min_samples_leaf=5,  # prevent tiny leaves
    n_jobs=-1,           # use all CPU cores
)
model.fit(X_train, y_train)
 
# Feature importance — which features matter most
importances = model.feature_importances_

Strengths and weaknesses

Strengths:

  • Works well out of the box with minimal tuning
  • Handles missing values, mixed feature types
  • Gives feature importance for free
  • Hard to overfit with enough trees

Weaknesses:

  • Slow to train with many trees
  • Not great for very high-dimensional sparse data (text)
  • Can’t extrapolate beyond training data range