Feature Engineering

What

Creating new input features from raw data. Often the single biggest lever for improving model performance.

Common techniques

Numeric

# Log transform for skewed data
df["log_income"] = np.log1p(df["income"])
 
# Binning
df["age_group"] = pd.cut(df["age"], bins=[0, 18, 35, 50, 65, 100])
 
# Interactions
df["rooms_per_person"] = df["rooms"] / df["population"]
 
# Polynomial features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=True)
X_poly = poly.fit_transform(X)

Categorical

# One-hot encoding (few categories)
pd.get_dummies(df, columns=["color"])
 
# Label encoding (ordinal categories)
df["size"] = df["size"].map({"S": 0, "M": 1, "L": 2, "XL": 3})
 
# Target encoding (many categories — careful of leakage)
means = df.groupby("city")["target"].mean()
df["city_encoded"] = df["city"].map(means)

Datetime

df["hour"] = df["timestamp"].dt.hour
df["day_of_week"] = df["timestamp"].dt.dayofweek
df["is_weekend"] = df["day_of_week"].isin([5, 6]).astype(int)
df["month"] = df["timestamp"].dt.month

Text

df["text_length"] = df["text"].str.len()
df["word_count"] = df["text"].str.split().str.len()
df["has_url"] = df["text"].str.contains("http").astype(int)

What makes a good feature

  • Predictive: actually relates to the target
  • Available at prediction time: don’t use future data
  • Not leaky: doesn’t contain the answer (or a proxy for it)