Feature Engineering Cookbook
Goal: Practical, copy-paste-ready recipes for the most common feature transforms. Real data, real code, real results.
Prerequisites: Feature Engineering, Feature Scaling, Exploratory Data Analysis, Data Cleaning
Setup
We’ll use the Ames Housing dataset — rich enough to show all transforms:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
# Load Ames Housing
housing = fetch_openml(name="house_prices", as_frame=True)
df = housing.frame
print(f"Shape: {df.shape}")
print(f"Target: SalePrice")
df.head()Baseline
Before any engineering — just throw numeric columns at a model:
numeric_cols = df.select_dtypes(include=[np.number]).columns.drop("SalePrice")
X_base = df[numeric_cols].fillna(0)
y = np.log1p(df["SalePrice"]) # log-transform target for normality
baseline = cross_val_score(Ridge(), StandardScaler().fit_transform(X_base), y, cv=5, scoring="r2")
print(f"Baseline R²: {baseline.mean():.4f} ± {baseline.std():.4f}")Recipe 1: Log Transform Skewed Features
Right-skewed features (long tail on the right) compress the scale and help linear models:
from scipy.stats import skew
# Find skewed numeric features
skewness = X_base.apply(skew).sort_values(ascending=False)
skewed = skewness[skewness > 1].index
print(f"Highly skewed features ({len(skewed)}):")
print(skewness[skewness > 1].head(10))
# Log-transform them
X_log = X_base.copy()
X_log[skewed] = np.log1p(X_base[skewed].clip(lower=0))
# Compare distributions
fig, axes = plt.subplots(2, 3, figsize=(14, 8))
for ax, col in zip(axes[0], skewed[:3]):
ax.hist(X_base[col], bins=50)
ax.set_title(f"{col} (original, skew={skew(X_base[col]):.1f})")
for ax, col in zip(axes[1], skewed[:3]):
ax.hist(X_log[col], bins=50)
ax.set_title(f"{col} (log-transformed)")
plt.tight_layout()
plt.show()When to use: Linear models, neural nets. Not needed for tree models.
Recipe 2: Categorical Encoding
One-hot encoding (few categories)
# Identify categorical columns with reasonable cardinality
cat_cols = df.select_dtypes(include=["object", "category"]).columns
for col in cat_cols[:5]:
print(f"{col}: {df[col].nunique()} unique values")
# One-hot encode low-cardinality features
low_card = [col for col in cat_cols if df[col].nunique() <= 10]
X_onehot = pd.get_dummies(df[low_card], drop_first=True)
print(f"One-hot features: {X_onehot.shape[1]}")Ordinal encoding (ordered categories)
# Quality features have natural order
quality_map = {"Po": 1, "Fa": 2, "TA": 3, "Gd": 4, "Ex": 5}
quality_cols = ["ExterQual", "ExterCond", "BsmtQual", "KitchenQual", "HeatingQC"]
X_ordinal = pd.DataFrame()
for col in quality_cols:
if col in df.columns:
X_ordinal[col] = df[col].map(quality_map).fillna(0)
print(X_ordinal.head())Target encoding (many categories)
def target_encode(df, col, target, smoothing=10):
"""Encode categorical with mean target value, smoothed toward global mean."""
global_mean = target.mean()
agg = target.groupby(df[col]).agg(["mean", "count"])
smooth = (agg["count"] * agg["mean"] + smoothing * global_mean) / (agg["count"] + smoothing)
return df[col].map(smooth).fillna(global_mean)
# Only do this on training data to avoid leakage!
X_train_df, X_test_df, y_train, y_test = train_test_split(df, y, test_size=0.2, random_state=42)
neighborhood_encoded = target_encode(X_train_df, "Neighborhood", y_train)
print(neighborhood_encoded.head())Leakage warning: Target encoding uses the target — always fit on train only.
Recipe 3: Interaction Features
Combine features that have a multiplicative relationship:
X_eng = X_base.copy()
# Total area
X_eng["TotalSF"] = df["TotalBsmtSF"].fillna(0) + df["1stFlrSF"] + df["2ndFlrSF"]
# Total bathrooms
X_eng["TotalBath"] = (df["FullBath"] + 0.5 * df["HalfBath"] +
df["BsmtFullBath"].fillna(0) + 0.5 * df["BsmtHalfBath"].fillna(0))
# Quality × Size
X_eng["QualxSF"] = df["OverallQual"] * X_eng["TotalSF"]
# Age
X_eng["Age"] = df["YrSold"] - df["YearBuilt"]
X_eng["RemodAge"] = df["YrSold"] - df["YearRemodAdd"]
print(f"Features added: TotalSF, TotalBath, QualxSF, Age, RemodAge")Recipe 4: Binning
Turn continuous features into categories when the relationship is non-linear:
# Age bins — value drops nonlinearly with age
X_eng["AgeBin"] = pd.cut(X_eng["Age"], bins=[0, 5, 15, 30, 60, 200],
labels=[0, 1, 2, 3, 4]).astype(float)
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].scatter(X_eng["Age"], y, s=5, alpha=0.3)
axes[0].set_xlabel("Age"); axes[0].set_ylabel("log(SalePrice)")
axes[0].set_title("Continuous age vs price")
axes[1].scatter(X_eng["AgeBin"], y, s=5, alpha=0.3)
axes[1].set_xlabel("Age bin"); axes[1].set_ylabel("log(SalePrice)")
axes[1].set_title("Binned age vs price")
plt.tight_layout()
plt.show()When to use: Linear models that can’t capture non-linearity. Trees don’t need binning.
Recipe 5: Missing Value Indicators
Sometimes that a value is missing is informative:
# Columns with missing values
missing = df.isnull().sum()
missing_cols = missing[missing > 0].index
print(f"Columns with missing values: {len(missing_cols)}")
# Create indicator columns
for col in missing_cols:
if col in X_eng.columns:
X_eng[f"{col}_missing"] = df[col].isnull().astype(int)
# Example: GarageYrBlt missing means no garage
print(f"Missing GarageYrBlt → no garage. Mean price with garage: "
f"{y[df['GarageYrBlt'].notna()].mean():.3f}, without: "
f"{y[df['GarageYrBlt'].isna()].mean():.3f}")Recipe 6: Polynomial Features (Targeted)
Don’t use PolynomialFeatures blindly — pick features where you expect non-linearity:
# OverallQual has a non-linear relationship with price
X_eng["OverallQual_sq"] = df["OverallQual"] ** 2
X_eng["GrLivArea_sq"] = df["GrLivArea"] ** 2
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].scatter(df["OverallQual"], y, s=5, alpha=0.3)
axes[0].set_title("Qual vs Price (quadratic?)")
axes[1].scatter(df["GrLivArea"], y, s=5, alpha=0.3)
axes[1].set_title("Living Area vs Price")
plt.tight_layout()
plt.show()Measure the Impact
# Combine all engineered features
X_final = pd.concat([X_eng, X_ordinal, X_onehot], axis=1).fillna(0)
X_final = X_final.select_dtypes(include=[np.number])
# Compare baseline vs engineered
scores_base = cross_val_score(Ridge(), StandardScaler().fit_transform(X_base), y, cv=5, scoring="r2")
scores_eng = cross_val_score(Ridge(), StandardScaler().fit_transform(X_final), y, cv=5, scoring="r2")
print(f"Baseline R²: {scores_base.mean():.4f} ± {scores_base.std():.4f}")
print(f"Engineered R²: {scores_eng.mean():.4f} ± {scores_eng.std():.4f}")
print(f"Improvement: {scores_eng.mean() - scores_base.mean():.4f}")Feature Importance Sanity Check
Which features matter most?
model = GradientBoostingRegressor(n_estimators=100, random_state=42)
model.fit(StandardScaler().fit_transform(X_final), y)
importances = pd.Series(model.feature_importances_, index=X_final.columns)
importances.nlargest(15).plot.barh(figsize=(10, 6))
plt.xlabel("Feature importance")
plt.title("Top 15 features")
plt.gca().invert_yaxis()
plt.show()Exercises
-
Ratio features: Create
BathPerRoom = TotalBath / TotRmsAbvGrd. Do ratio features improve performance? -
Rare category grouping: For
Neighborhood, group neighborhoods with < 20 houses into “Other”. Does this help or hurt? -
Temporal features: If you have a timestamp, extract day_of_week, month, is_weekend, quarter. Apply this to a time-series dataset.
-
Automatic feature selection: After engineering 50+ features, use
SelectKBestor LASSO to drop the worst. What’s the optimal feature count?
Next: 10 - Cross-Validation Done Right — evaluate models without fooling yourself.