Exploratory Data Analysis
What
Understanding your data through statistics and visualization before modeling. EDA guides every decision downstream.
The checklist
1. Target variable
# Classification: is it balanced?
df["target"].value_counts(normalize=True)
# Regression: what's the distribution?
df["target"].hist(bins=50)2. Feature distributions
import seaborn as sns
# Numeric features
df.hist(figsize=(12, 8), bins=30)
# Categorical features
for col in df.select_dtypes("object"):
print(df[col].value_counts().head())3. Relationships
# Correlation matrix
sns.heatmap(df.corr(), annot=True, cmap="coolwarm", center=0)
# Feature vs target
sns.boxplot(data=df, x="target", y="feature")
sns.scatterplot(data=df, x="feature1", y="feature2", hue="target")4. Things to look for
- Imbalanced classes: need stratified splits, possibly oversampling
- Highly correlated features: redundant, may hurt some models
- Skewed distributions: may need log transform
- Feature-target relationship: is it linear? nonlinear? no relationship?
- Clusters or groups: may suggest different populations in the data