Exploratory Data Analysis

What

Understanding your data through statistics and visualization before modeling. EDA guides every decision downstream.

The checklist

1. Target variable

# Classification: is it balanced?
df["target"].value_counts(normalize=True)
 
# Regression: what's the distribution?
df["target"].hist(bins=50)

2. Feature distributions

import seaborn as sns
 
# Numeric features
df.hist(figsize=(12, 8), bins=30)
 
# Categorical features
for col in df.select_dtypes("object"):
    print(df[col].value_counts().head())

3. Relationships

# Correlation matrix
sns.heatmap(df.corr(), annot=True, cmap="coolwarm", center=0)
 
# Feature vs target
sns.boxplot(data=df, x="target", y="feature")
sns.scatterplot(data=df, x="feature1", y="feature2", hue="target")

4. Things to look for

  • Imbalanced classes: need stratified splits, possibly oversampling
  • Highly correlated features: redundant, may hurt some models
  • Skewed distributions: may need log transform
  • Feature-target relationship: is it linear? nonlinear? no relationship?
  • Clusters or groups: may suggest different populations in the data