Feature Engineering
What
Creating new input features from raw data. Often the single biggest lever for improving model performance.
Common techniques
Numeric
# Log transform for skewed data
df["log_income"] = np.log1p(df["income"])
# Binning
df["age_group"] = pd.cut(df["age"], bins=[0, 18, 35, 50, 65, 100])
# Interactions
df["rooms_per_person"] = df["rooms"] / df["population"]
# Polynomial features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=True)
X_poly = poly.fit_transform(X)Categorical
# One-hot encoding (few categories)
pd.get_dummies(df, columns=["color"])
# Label encoding (ordinal categories)
df["size"] = df["size"].map({"S": 0, "M": 1, "L": 2, "XL": 3})
# Target encoding (many categories — careful of leakage)
means = df.groupby("city")["target"].mean()
df["city_encoded"] = df["city"].map(means)Datetime
df["hour"] = df["timestamp"].dt.hour
df["day_of_week"] = df["timestamp"].dt.dayofweek
df["is_weekend"] = df["day_of_week"].isin([5, 6]).astype(int)
df["month"] = df["timestamp"].dt.monthText
df["text_length"] = df["text"].str.len()
df["word_count"] = df["text"].str.split().str.len()
df["has_url"] = df["text"].str.contains("http").astype(int)What makes a good feature
- Predictive: actually relates to the target
- Available at prediction time: don’t use future data
- Not leaky: doesn’t contain the answer (or a proxy for it)