Word Embeddings
What
Dense vector representations of words where similar words have similar vectors. Instead of one-hot encoding (sparse, no meaning), each word becomes a vector of 50-300 floats that capture semantic relationships.
Word2Vec
Two architectures, both trained on predicting words from context:
- CBOW (Continuous Bag of Words): predict center word from surrounding words. Faster, better for frequent words
- Skip-gram: predict surrounding words from center word. Better for rare words, works well with small datasets
Both learn vectors where relationships are encoded as directions in vector space.
Key property: vector arithmetic
king - man + woman ≈ queen
paris - france + japan ≈ tokyo
The model learns these relationships purely from co-occurrence patterns in text — nobody labeled them.
GloVe (Global Vectors)
Instead of sliding a window through text, GloVe builds a global word co-occurrence matrix and factorizes it. Combines the advantages of count-based methods (use global statistics) with prediction-based methods (learn dense vectors).
FastText
Extends Word2Vec by representing words as bags of character n-grams. “playing” = {pla, lay, ayi, yin, ing, play, layi, …}. This means it can generate vectors for words it has never seen before by combining subword pieces.
Pretrained vs training your own
| Approach | When | Source |
|---|---|---|
| Pretrained (GloVe, FastText) | General text, quick start | 6B-840B token corpora |
| Train your own | Domain-specific vocabulary (medical, legal) | Your corpus |
| Fine-tune pretrained | Best of both | Pretrained + your corpus |
Python example
import gensim.downloader as api
# load pretrained GloVe vectors
model = api.load("glove-wiki-gigaword-100") # 100-dim vectors
# find similar words
model.most_similar("python") # [('java', 0.75), ('perl', 0.69), ...]
# vector arithmetic
model.most_similar(positive=["king", "woman"], negative=["man"])
# [('queen', 0.73), ...]Limitations
- Static: one vector per word regardless of context. “bank” (river) and “bank” (financial) share the same vector
- Out-of-vocabulary: Word2Vec and GloVe can’t handle words not in training data (FastText partially solves this)
- Superseded by contextual embeddings: models like BERT generate different vectors for the same word depending on context
Still worth understanding — contextual embeddings build on the same intuitions.
Links
- Embeddings — the general concept beyond words
- Text Preprocessing — what happens before embedding
- Bag of Words and TF-IDF — the sparse representations embeddings replace
- BERT and Masked Language Models — contextual embeddings that supersede static ones