Exploratory Data Analysis

What

Understanding your data through statistics and visualization before modeling. EDA guides every decision downstream.

The checklist

1. Target variable

# Classification: is it balanced?
df["target"].value_counts(normalize=True)
 
# Regression: what's the distribution?
df["target"].hist(bins=50)

2. Feature distributions

import seaborn as sns
 
# Numeric features
df.hist(figsize=(12, 8), bins=30)
 
# Categorical features
for col in df.select_dtypes("object"):
    print(df[col].value_counts().head())

3. Relationships

# Correlation matrix
sns.heatmap(df.corr(), annot=True, cmap="coolwarm", center=0)
 
# Feature vs target
sns.boxplot(data=df, x="target", y="feature")
sns.scatterplot(data=df, x="feature1", y="feature2", hue="target")

4. Things to look for

Imbalanced classes: need stratified splits, possibly oversampling
Highly correlated features: redundant, may hurt some models
Skewed distributions: may need log transform
Feature-target relationship: is it linear? nonlinear? no relationship?
Clusters or groups: may suggest different populations in the data

AI/ML Notes

Explorer

Exploratory Data Analysis

Exploratory Data Analysis

What

The checklist

1. Target variable

2. Feature distributions

3. Relationships

4. Things to look for

Links

Graph View

Table of Contents

Backlinks