Word Embeddings

What

Dense vector representations of words where similar words have similar vectors. Instead of one-hot encoding (sparse, no meaning), each word becomes a vector of 50-300 floats that capture semantic relationships.

Word2Vec

Two architectures, both trained on predicting words from context:

  • CBOW (Continuous Bag of Words): predict center word from surrounding words. Faster, better for frequent words
  • Skip-gram: predict surrounding words from center word. Better for rare words, works well with small datasets

Both learn vectors where relationships are encoded as directions in vector space.

Key property: vector arithmetic

king - man + woman ≈ queen
paris - france + japan ≈ tokyo

The model learns these relationships purely from co-occurrence patterns in text — nobody labeled them.

GloVe (Global Vectors)

Instead of sliding a window through text, GloVe builds a global word co-occurrence matrix and factorizes it. Combines the advantages of count-based methods (use global statistics) with prediction-based methods (learn dense vectors).

FastText

Extends Word2Vec by representing words as bags of character n-grams. “playing” = {pla, lay, ayi, yin, ing, play, layi, …}. This means it can generate vectors for words it has never seen before by combining subword pieces.

Pretrained vs training your own

ApproachWhenSource
Pretrained (GloVe, FastText)General text, quick start6B-840B token corpora
Train your ownDomain-specific vocabulary (medical, legal)Your corpus
Fine-tune pretrainedBest of bothPretrained + your corpus

Python example

import gensim.downloader as api
 
# load pretrained GloVe vectors
model = api.load("glove-wiki-gigaword-100")  # 100-dim vectors
 
# find similar words
model.most_similar("python")  # [('java', 0.75), ('perl', 0.69), ...]
 
# vector arithmetic
model.most_similar(positive=["king", "woman"], negative=["man"])
# [('queen', 0.73), ...]

Limitations

  • Static: one vector per word regardless of context. “bank” (river) and “bank” (financial) share the same vector
  • Out-of-vocabulary: Word2Vec and GloVe can’t handle words not in training data (FastText partially solves this)
  • Superseded by contextual embeddings: models like BERT generate different vectors for the same word depending on context

Still worth understanding — contextual embeddings build on the same intuitions.