Embeddings

What

Dense vector representations of discrete things (words, items, users). Maps high-dimensional sparse data (one-hot) into low-dimensional continuous space where similar things are close together.

Why it matters

  • “King” - “Man” + “Woman” ≈ “Queen” — arithmetic on word embeddings captures meaning
  • Similar words have similar vectors → models generalize across synonyms
  • Used everywhere: words, sentences, images, products, users, graph nodes

Word embeddings

MethodHowNotes
Word2VecPredict word from context (or vice versa)Classic, fast to train
GloVeMatrix factorization of co-occurrence matrixGood quality, pretrained available
FastTextWord2Vec + subword informationHandles rare/misspelled words
Contextual (BERT, GPT)Same word gets different embeddings in different contextsState of the art

In PyTorch

import torch.nn as nn
 
# Embedding layer: lookup table of learnable vectors
embed = nn.Embedding(num_embeddings=10000, embedding_dim=256)
 
# Input: token IDs → Output: dense vectors
token_ids = torch.tensor([42, 7, 1337])
vectors = embed(token_ids)  # shape: (3, 256)

Sentence/document embeddings

For comparing or searching text, embed entire sentences into a single vector:

from sentence_transformers import SentenceTransformer
 
# Good for learning/prototyping. For production, consider bge-m3 or all-mpnet-base-v2
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(["Hello world", "Hi there"])