The AI/ML Mind Map
Everything else is detail. This page is the thinking framework — the patterns that repeat across every model, every algorithm, every technique.
The Three Core Ideas
1. Learning is optimization
Every ML algorithm:
Define a loss function (how wrong am I?)
Compute the gradient (which direction reduces wrongness?)
Update parameters (step in that direction)
Repeat
Linear regression, neural networks, gradient boosting, RLHF — all are gradient descent on different loss functions with different parameterizations. The math is the same. The architecture changes.
When understanding any model, ask: what is the loss function, and what is being optimized?
2. All models trade bias against variance
Simple model (few parameters): high bias, low variance → underfits
Misses real patterns. Predictions are wrong but stable.
Complex model (many parameters): low bias, high variance → overfits
Memorizes noise. Predictions are accurate on training data, wild on new data.
The art: find the complexity that captures the signal without fitting the noise.
Regularization, dropout, early stopping, cross-validation, ensemble methods — all are techniques for navigating this tradeoff. They look different but solve the same problem.
When a model fails, ask: is it underfitting (need more capacity) or overfitting (need more constraint)?
3. Representation is everything
Raw data → useful representation → simple model works
Bad representation: try to classify images from raw pixel values with logistic regression → fails
Good representation: extract features (edges, textures, shapes) → logistic regression works
The revolution: deep learning LEARNS the representation
Raw pixels → conv layers learn edges → deeper layers learn shapes → final layers learn objects
Raw text → embedding layers learn word meaning → attention learns relationships
Feature engineering (classical ML) and architecture design (deep learning) are both about finding the right representation. PCA, embeddings, attention, convolution — all are representation transformers.
When building any model, ask: does the model see the data in a form where the pattern is obvious?
The Universal ML Pipeline
Every ML project — from Kaggle competition to production system — follows this structure:
UNDERSTAND → PREPARE → MODEL → EVALUATE → DEPLOY → MONITOR
1. UNDERSTAND the problem
- What are you predicting? Why?
- What data exists? What's missing?
- What would a human expert do?
- What's the baseline? (always start with the simplest possible approach)
2. PREPARE the data
- Explore: distributions, correlations, anomalies
- Clean: missing values, outliers, inconsistencies
- Engineer features: create useful inputs from raw data
- Split: train/validation/test (NEVER leak between them)
3. MODEL
- Start simple (linear model, decision tree) → establish baseline
- Increase complexity only if needed (random forest → gradient boosting → neural net)
- Tune hyperparameters (but only after the approach is right)
4. EVALUATE honestly
- Right metric for the problem (accuracy is usually wrong for imbalanced data)
- Cross-validation (not a single split)
- Test set touched ONCE at the very end
- Sanity check: does the model make sense? Feature importances reasonable?
5. DEPLOY (if applicable)
- Model serving (API, batch, edge)
- Monitoring for drift
6. MONITOR
- Data drift: is the input distribution changing?
- Concept drift: is the relationship between inputs and outputs changing?
- Performance degradation: retrain when metrics drop
The Five Types of ML Problems
Every ML task maps to one of these. Recognize the type and you know which techniques apply.
1. Classification (predict a category)
Input: features → Output: class label
Binary: spam/not-spam, fraud/legitimate
Multi-class: digit 0-9, animal species
Multi-label: image tags (can have multiple)
Loss: cross-entropy
Metrics: accuracy, precision, recall, F1, AUC-ROC
Models: logistic regression → random forest → gradient boosting → neural net
2. Regression (predict a number)
Input: features → Output: continuous value
House price, temperature, stock price, age
Loss: MSE, MAE, Huber
Metrics: RMSE, MAE, R²
Models: linear regression → random forest → gradient boosting → neural net
3. Sequence modeling (predict next in sequence)
Input: sequence → Output: next element or transformed sequence
Language modeling, translation, speech recognition, time series
Loss: cross-entropy (per token), MSE (regression)
Models: RNN/LSTM (legacy) → Transformer (current standard)
Key: attention mechanism
4. Representation learning (learn a useful embedding)
Input: raw data → Output: dense vector in meaningful space
Word embeddings, image features, speaker embeddings, sentence encodings
Loss: contrastive (similar things close, different things far)
Models: Word2Vec, BERT, CLIP, autoencoders, SimCLR
Application: similarity search, transfer learning, clustering
5. Generation (create new data)
Input: noise / prompt / condition → Output: new data (image, text, audio)
Text generation, image synthesis, music, voice cloning
Loss: varies (adversarial, diffusion, autoregressive likelihood)
Models: GPT (autoregressive), Diffusion (denoising), GAN (adversarial)
Key challenge: quality + diversity + controllability
The Recurring Patterns
Pattern: Compression is understanding
A model that predicts well has learned to compress the data.
Compression = discarding irrelevant information, keeping structure.
PCA compresses by finding principal directions.
Autoencoders compress through a bottleneck.
Language models compress by predicting the next token.
Neural nets compress by learning hierarchical features.
If you can compress it, you understand it.
If you can predict it, you've captured its structure.
Pattern: The unreasonable effectiveness of simple baselines
Before building a complex model:
- Classification: what does "predict the majority class" give you?
- Regression: what does "predict the mean" give you?
- NLP: what does TF-IDF + logistic regression give you?
- Vision: what does a pretrained ResNet give you?
Often: 80% of the final performance with 5% of the complexity.
The gap between baseline and SOTA is where you decide if complexity is worth it.
Pattern: Regularization is everywhere (just in different disguises)
L1/L2 penalties on weights = explicit regularization
Dropout = implicit regularization (ensemble of subnetworks)
Early stopping = regularization by limiting training
Data augmentation = regularization by expanding apparent data
Batch normalization = regularization by adding noise
Smaller model = regularization by limiting capacity
Ensemble methods = regularization by averaging
All do the same thing: prevent the model from fitting noise.
Pattern: More data beats better algorithms
A simple model on lots of data usually beats
a complex model on little data.
Scaling laws (Kaplan et al.): performance improves as a power law
with more data, more compute, and more parameters.
This is why foundation models (trained on internet-scale data)
are so powerful — and why fine-tuning beats training from scratch.
Pattern: The feature importance hierarchy
In tabular data: feature engineering > model choice > hyperparameter tuning
In NLP: pretraining data > architecture > fine-tuning > prompting
In vision: data quality > augmentation > architecture > training tricks
In all domains: data quality > everything else
Pattern: Everything is a vector
Words → vectors (embeddings)
Images → vectors (CNN features, CLIP embeddings)
Audio → vectors (speaker embeddings, MFCCs)
Users → vectors (collaborative filtering)
Graphs → vectors (node embeddings)
Once everything is a vector, the same math works everywhere:
cosine similarity, nearest neighbors, clustering, classification
The Meta-Questions
When studying ANY ML topic, always ask:
- What is the loss function? (what is being optimized)
- What is the inductive bias? (what assumptions does the architecture encode)
- What would the simplest baseline be? (before getting fancy)
- Where could data leak? (train/test contamination)
- What representation does the model learn? (look at embeddings/features)
- What fails at scale? (data size, latency, cost)
- What does the model NOT capture? (limitations, failure modes)
Map to the Vault
| Pattern | Where to study it |
|---|---|
| Optimization / gradient descent | Gradient Descent, Loss Functions, Optimizers, Backpropagation |
| Bias-variance tradeoff | Bias-Variance Tradeoff, Regularization, Cross-Validation |
| Representation learning | Embeddings, Transfer Learning, Autoencoders, PCA |
| The ML pipeline | Data Fundamentals Roadmap, Feature Engineering, Train-Test Split |
| Classification | Logistic Regression, Random Forests, Gradient Boosting |
| Sequence modeling | Transformers, Attention Mechanism, Recurrent Neural Networks |
| Generation | Diffusion Models, Text Generation, Image Generation |
| Scaling / foundation models | Transfer Learning, Fine-Tuning LLMs, Language Models |
| Everything is a vector | Embeddings, Dot Product, Cosine Similarity and Distance Metrics |
| Math foundations | Math Roadmap, Gradient, Probability Basics, Bayes Theorem |