Feature Engineering Cookbook

Goal: Practical, copy-paste-ready recipes for the most common feature transforms. Real data, real code, real results.

Prerequisites: Feature Engineering, Feature Scaling, Exploratory Data Analysis, Data Cleaning


Setup

We’ll use the Ames Housing dataset — rich enough to show all transforms:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
 
# Load Ames Housing
housing = fetch_openml(name="house_prices", as_frame=True)
df = housing.frame
print(f"Shape: {df.shape}")
print(f"Target: SalePrice")
df.head()

Baseline

Before any engineering — just throw numeric columns at a model:

numeric_cols = df.select_dtypes(include=[np.number]).columns.drop("SalePrice")
X_base = df[numeric_cols].fillna(0)
y = np.log1p(df["SalePrice"])  # log-transform target for normality
 
baseline = cross_val_score(Ridge(), StandardScaler().fit_transform(X_base), y, cv=5, scoring="r2")
print(f"Baseline R²: {baseline.mean():.4f} ± {baseline.std():.4f}")

Recipe 1: Log Transform Skewed Features

Right-skewed features (long tail on the right) compress the scale and help linear models:

from scipy.stats import skew
 
# Find skewed numeric features
skewness = X_base.apply(skew).sort_values(ascending=False)
skewed = skewness[skewness > 1].index
print(f"Highly skewed features ({len(skewed)}):")
print(skewness[skewness > 1].head(10))
 
# Log-transform them
X_log = X_base.copy()
X_log[skewed] = np.log1p(X_base[skewed].clip(lower=0))
 
# Compare distributions
fig, axes = plt.subplots(2, 3, figsize=(14, 8))
for ax, col in zip(axes[0], skewed[:3]):
    ax.hist(X_base[col], bins=50)
    ax.set_title(f"{col} (original, skew={skew(X_base[col]):.1f})")
for ax, col in zip(axes[1], skewed[:3]):
    ax.hist(X_log[col], bins=50)
    ax.set_title(f"{col} (log-transformed)")
plt.tight_layout()
plt.show()

When to use: Linear models, neural nets. Not needed for tree models.


Recipe 2: Categorical Encoding

One-hot encoding (few categories)

# Identify categorical columns with reasonable cardinality
cat_cols = df.select_dtypes(include=["object", "category"]).columns
for col in cat_cols[:5]:
    print(f"{col}: {df[col].nunique()} unique values")
 
# One-hot encode low-cardinality features
low_card = [col for col in cat_cols if df[col].nunique() <= 10]
X_onehot = pd.get_dummies(df[low_card], drop_first=True)
print(f"One-hot features: {X_onehot.shape[1]}")

Ordinal encoding (ordered categories)

# Quality features have natural order
quality_map = {"Po": 1, "Fa": 2, "TA": 3, "Gd": 4, "Ex": 5}
quality_cols = ["ExterQual", "ExterCond", "BsmtQual", "KitchenQual", "HeatingQC"]
 
X_ordinal = pd.DataFrame()
for col in quality_cols:
    if col in df.columns:
        X_ordinal[col] = df[col].map(quality_map).fillna(0)
print(X_ordinal.head())

Target encoding (many categories)

def target_encode(df, col, target, smoothing=10):
    """Encode categorical with mean target value, smoothed toward global mean."""
    global_mean = target.mean()
    agg = target.groupby(df[col]).agg(["mean", "count"])
    smooth = (agg["count"] * agg["mean"] + smoothing * global_mean) / (agg["count"] + smoothing)
    return df[col].map(smooth).fillna(global_mean)
 
# Only do this on training data to avoid leakage!
X_train_df, X_test_df, y_train, y_test = train_test_split(df, y, test_size=0.2, random_state=42)
 
neighborhood_encoded = target_encode(X_train_df, "Neighborhood", y_train)
print(neighborhood_encoded.head())

Leakage warning: Target encoding uses the target — always fit on train only.


Recipe 3: Interaction Features

Combine features that have a multiplicative relationship:

X_eng = X_base.copy()
 
# Total area
X_eng["TotalSF"] = df["TotalBsmtSF"].fillna(0) + df["1stFlrSF"] + df["2ndFlrSF"]
 
# Total bathrooms
X_eng["TotalBath"] = (df["FullBath"] + 0.5 * df["HalfBath"] +
                       df["BsmtFullBath"].fillna(0) + 0.5 * df["BsmtHalfBath"].fillna(0))
 
# Quality × Size
X_eng["QualxSF"] = df["OverallQual"] * X_eng["TotalSF"]
 
# Age
X_eng["Age"] = df["YrSold"] - df["YearBuilt"]
X_eng["RemodAge"] = df["YrSold"] - df["YearRemodAdd"]
 
print(f"Features added: TotalSF, TotalBath, QualxSF, Age, RemodAge")

Recipe 4: Binning

Turn continuous features into categories when the relationship is non-linear:

# Age bins — value drops nonlinearly with age
X_eng["AgeBin"] = pd.cut(X_eng["Age"], bins=[0, 5, 15, 30, 60, 200],
                          labels=[0, 1, 2, 3, 4]).astype(float)
 
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].scatter(X_eng["Age"], y, s=5, alpha=0.3)
axes[0].set_xlabel("Age"); axes[0].set_ylabel("log(SalePrice)")
axes[0].set_title("Continuous age vs price")
 
axes[1].scatter(X_eng["AgeBin"], y, s=5, alpha=0.3)
axes[1].set_xlabel("Age bin"); axes[1].set_ylabel("log(SalePrice)")
axes[1].set_title("Binned age vs price")
plt.tight_layout()
plt.show()

When to use: Linear models that can’t capture non-linearity. Trees don’t need binning.


Recipe 5: Missing Value Indicators

Sometimes that a value is missing is informative:

# Columns with missing values
missing = df.isnull().sum()
missing_cols = missing[missing > 0].index
print(f"Columns with missing values: {len(missing_cols)}")
 
# Create indicator columns
for col in missing_cols:
    if col in X_eng.columns:
        X_eng[f"{col}_missing"] = df[col].isnull().astype(int)
 
# Example: GarageYrBlt missing means no garage
print(f"Missing GarageYrBlt → no garage. Mean price with garage: "
      f"{y[df['GarageYrBlt'].notna()].mean():.3f}, without: "
      f"{y[df['GarageYrBlt'].isna()].mean():.3f}")

Recipe 6: Polynomial Features (Targeted)

Don’t use PolynomialFeatures blindly — pick features where you expect non-linearity:

# OverallQual has a non-linear relationship with price
X_eng["OverallQual_sq"] = df["OverallQual"] ** 2
X_eng["GrLivArea_sq"] = df["GrLivArea"] ** 2
 
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].scatter(df["OverallQual"], y, s=5, alpha=0.3)
axes[0].set_title("Qual vs Price (quadratic?)")
axes[1].scatter(df["GrLivArea"], y, s=5, alpha=0.3)
axes[1].set_title("Living Area vs Price")
plt.tight_layout()
plt.show()

Measure the Impact

# Combine all engineered features
X_final = pd.concat([X_eng, X_ordinal, X_onehot], axis=1).fillna(0)
X_final = X_final.select_dtypes(include=[np.number])
 
# Compare baseline vs engineered
scores_base = cross_val_score(Ridge(), StandardScaler().fit_transform(X_base), y, cv=5, scoring="r2")
scores_eng = cross_val_score(Ridge(), StandardScaler().fit_transform(X_final), y, cv=5, scoring="r2")
 
print(f"Baseline R²:    {scores_base.mean():.4f} ± {scores_base.std():.4f}")
print(f"Engineered R²:  {scores_eng.mean():.4f} ± {scores_eng.std():.4f}")
print(f"Improvement:    {scores_eng.mean() - scores_base.mean():.4f}")

Feature Importance Sanity Check

Which features matter most?

model = GradientBoostingRegressor(n_estimators=100, random_state=42)
model.fit(StandardScaler().fit_transform(X_final), y)
 
importances = pd.Series(model.feature_importances_, index=X_final.columns)
importances.nlargest(15).plot.barh(figsize=(10, 6))
plt.xlabel("Feature importance")
plt.title("Top 15 features")
plt.gca().invert_yaxis()
plt.show()

Exercises

  1. Ratio features: Create BathPerRoom = TotalBath / TotRmsAbvGrd. Do ratio features improve performance?

  2. Rare category grouping: For Neighborhood, group neighborhoods with < 20 houses into “Other”. Does this help or hurt?

  3. Temporal features: If you have a timestamp, extract day_of_week, month, is_weekend, quarter. Apply this to a time-series dataset.

  4. Automatic feature selection: After engineering 50+ features, use SelectKBest or LASSO to drop the worst. What’s the optimal feature count?


Next: 10 - Cross-Validation Done Right — evaluate models without fooling yourself.