Data Fundamentals Roadmap
Garbage in, garbage out. Most ML work is data work. A clean, well-understood dataset beats a fancier model every time.
Topics
- Loading and Inspecting Data — CSV, JSON, parquet, first look
- Data Cleaning — missing values, duplicates, outliers, types
- Exploratory Data Analysis — distributions, correlations, patterns
- Feature Engineering — creating useful inputs from raw data
- Feature Scaling — normalization, standardization, when and why
- Train-Test Split — why you must separate data, how to avoid leakage
The workflow
raw data → inspect → clean → explore → engineer features → scale → split → model
Every step feeds into the next. Shortcuts here create bugs that are invisible until production.
This isn’t a one-pass pipeline. You’ll loop: explore the data, realize a feature needs different cleaning, go back, re-explore. The diagram is linear but the real process is iterative. Expect to revisit earlier stages after you see model results.
Common pitfalls
These are subtle bugs that produce models that look great in training but fail in production.
| Pitfall | What happens | How to avoid |
|---|---|---|
| Data leakage | Information from test set leaks into training (e.g., fitting a scaler on full data before splitting) | Always split first, then fit transforms on train only |
| Target leakage | A feature that encodes the target (e.g., “treatment outcome” when predicting “will patient be treated”) | Audit features for causal relationship with target |
| Look-ahead bias | Using future data to predict the past (time-series: training on Tuesday to predict Monday) | Respect temporal order, use time-based splits |
| Class imbalance | 99% negative, 1% positive — model learns to always predict negative and gets 99% accuracy | Use stratified splits, appropriate metrics (F1, PR-AUC), resampling, or class weights |
The worst part: all of these can produce excellent validation metrics while being completely useless. Always sanity-check your results — if they look too good, something is probably leaking.
Links
- Python for ML Roadmap — tools for working with data
- Machine Learning Roadmap — what you do after data is ready
- Train-Test Split — where most leakage bugs live