Data Cleaning

What

Fixing data quality issues before modeling. This is where most time goes.

Common issues and fixes

Missing values

# How much is missing?
df.isnull().sum()
 
# Options:
df.dropna()                           # drop rows (if few missing)
df["col"].fillna(df["col"].median())  # fill numeric with median
df["col"].fillna(df["col"].mode()[0]) # fill categorical with mode
df["col"].fillna("unknown")           # explicit unknown category

Rule of thumb: if a column is >50% missing, consider dropping it. If a row has many missing values, consider dropping it. Otherwise, impute.

Duplicates

df.duplicated().sum()        # count duplicates
df.drop_duplicates()         # remove exact duplicates

Wrong types

df["date"] = pd.to_datetime(df["date"])
df["price"] = pd.to_numeric(df["price"], errors="coerce")  # invalid → NaN
df["category"] = df["category"].astype("category")

Outliers

# Z-score method
from scipy import stats
z = stats.zscore(df["col"])
df_clean = df[abs(z) < 3]
 
# IQR method
Q1, Q3 = df["col"].quantile([0.25, 0.75])
IQR = Q3 - Q1
df_clean = df[(df["col"] >= Q1 - 1.5*IQR) & (df["col"] <= Q3 + 1.5*IQR)]

Inconsistent categories

df["col"] = df["col"].str.lower().str.strip()
df["col"] = df["col"].replace({"ny": "new york", "NYC": "new york"})