Ensemble Methods

What

Combine multiple models to get better predictions than any single model. The core idea: individual models make different errors, and combining them cancels out those errors.

Bagging (Bootstrap Aggregating)

Train multiple models on random subsets of the data (with replacement), then average their predictions (regression) or vote (classification). Each model sees a different slice, so they make different mistakes.

Random Forests are the classic example: bag of decision trees, each also using random feature subsets.

Boosting

Train models sequentially. Each new model focuses on the mistakes of the previous ones. The final prediction is a weighted sum of all models.

  • AdaBoost: increase weight of misclassified samples, so next model focuses on hard cases
  • Gradient Boosting: each new model fits the residual errors (gradient of the loss)
  • XGBoost / LightGBM: optimized gradient boosting with regularization and speed tricks

Stacking

Train several different models (e.g., SVM, Random Forest, KNN), then use their predictions as features for a “meta-model” (often logistic regression). The meta-model learns which base models to trust for which inputs.

Comparison

AspectBaggingBoostingStacking
TrainingParallelSequentialTwo stages
ReducesVarianceBiasBoth
Overfitting riskLowHigher (can overfit noise)Medium
SpeedFast (parallelizable)Slower (sequential)Depends on base models
ExampleRandom ForestXGBoostBlending diverse models

Why ensembles work

  • Variance reduction (bagging): averaging noisy models smooths out random errors
  • Bias reduction (boosting): iteratively correcting errors lets simple models capture complex patterns
  • Diversity matters: ensembles of identical models don’t help. You need models that disagree on different inputs

Code example

from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
 
# soft voting = average predicted probabilities
ensemble = VotingClassifier(
    estimators=[
        ("lr", LogisticRegression()),
        ("svc", SVC(probability=True)),
        ("dt", DecisionTreeClassifier()),
    ],
    voting="soft",
)
ensemble.fit(X_train, y_train)
print(ensemble.score(X_test, y_test))