Embeddings
What
Dense vector representations of discrete things (words, items, users). Maps high-dimensional sparse data (one-hot) into low-dimensional continuous space where similar things are close together.
Why it matters
- “King” - “Man” + “Woman” ≈ “Queen” — arithmetic on word embeddings captures meaning
- Similar words have similar vectors → models generalize across synonyms
- Used everywhere: words, sentences, images, products, users, graph nodes
Word embeddings
| Method | How | Notes |
|---|---|---|
| Word2Vec | Predict word from context (or vice versa) | Classic, fast to train |
| GloVe | Matrix factorization of co-occurrence matrix | Good quality, pretrained available |
| FastText | Word2Vec + subword information | Handles rare/misspelled words |
| Contextual (BERT, GPT) | Same word gets different embeddings in different contexts | State of the art |
In PyTorch
import torch.nn as nn
# Embedding layer: lookup table of learnable vectors
embed = nn.Embedding(num_embeddings=10000, embedding_dim=256)
# Input: token IDs → Output: dense vectors
token_ids = torch.tensor([42, 7, 1337])
vectors = embed(token_ids) # shape: (3, 256)Sentence/document embeddings
For comparing or searching text, embed entire sentences into a single vector:
from sentence_transformers import SentenceTransformer
# Good for learning/prototyping. For production, consider bge-m3 or all-mpnet-base-v2
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(["Hello world", "Hi there"])Links
- Dot Product — similarity between embeddings
- Transformers — contextual embeddings
- NLP Roadmap