Bag of Words and TF-IDF
Bag of Words (BoW)
Represent text as a vector of word counts. Ignores word order entirely.
from sklearn.feature_extraction.text import CountVectorizer
docs = ["I love cats", "I love dogs", "cats and dogs"]
vec = CountVectorizer()
X = vec.fit_transform(docs)
# Sparse matrix: each row is a document, each column is a word index
# Shape: (3, 6) for 3 docs, 6 unique wordsMathematically: document d is represented as vector v where v_i = count of word i in d.
TF-IDF (Term Frequency - Inverse Document Frequency)
Like BoW, but weights words by how informative they are:
- TF: how often a word appears in this document
- IDF: how rare/distinctive the word is across all documents
TF-IDF(t, d, D) = TF(t, d) × IDF(t, D)
IDF(t, D) = log(N / df(t)) + 1
where N = total documents, df(t) = number of documents containing term t.
Words that appear everywhere (“the”, “is”, “and”) get low IDF weight. Words unique to a document get high weight.
Variants
Sublinear TF: 1 + log(tf) instead of raw count — dampens the effect of high-frequency terms.
Smooth IDF: log((N + 1) / (df(t) + 1)) + 1 — prevents division by zero for terms not in training corpus.
BM25: The production-grade version used in search engines:
score(D, Q) = Σ IDF(qi) × (tf × (k + 1)) / (tf + k × (1 - b + b × |D|/avgdl))
where k (typically 1.2–2.0) and b (typically 0.75) are hyperparameters.
N-gram extension
BoW/TF-IDF can capture multi-word phrases via n-grams:
vec = CountVectorizer(ngram_range=(1, 2)) # unigrams + bigrams
# "not good" as a bigram captures negation that unigrams missBe careful — n-gram vocabulary grows quickly. Use max_features or min_df to prune.
Document frequency filtering
Reduce vocabulary size and noise:
TfidfVectorizer(
max_features=10000, # keep top N by frequency
min_df=2, # drop terms appearing in < 2 docs
max_df=0.95, # drop terms appearing in > 95% of docs
ngram_range=(1, 2), # unigrams + bigrams
sublinear_tf=True, # use 1 + log(tf)
stop_words='english' # remove common English words
)Low df (rare terms): may be typos, noise, or highly specific vocabulary. High df (ubiquitous terms): stopwords, generic domain language. Low signal.
Implementation
Manual TF-IDF for understanding:
import math
def tf_idf(corpus):
N = len(corpus)
# Step 1: document frequencies
df = {}
for doc in corpus:
for token in set(doc):
df[token] = df.get(token, 0) + 1
# Step 2: IDF
idf = {t: math.log(N / df[t]) + 1 for t in df}
# Step 3: TF-IDF vectors
vectors = []
for doc in corpus:
tf = {}
for token in doc:
tf[token] = tf.get(token, 0) + 1
# Normalize by doc length
vec = {t: (count/len(doc)) * idf[t] for t, count in tf.items()}
vectors.append(vec)
return vectorsWhen to use BoW/TF-IDF
Good for:
- Quick baselines for text classification (spam detection, topic classification)
- When you don’t have GPU or pretrained models
- Classical ML pipelines (Naive Bayes, Logistic Regression, SVMs)
- Large-scale information retrieval where sparse matrices are acceptable
Bad for:
- Tasks requiring word order (“not good” vs “good not”)
- Tasks requiring semantics (“happy” and “glad” have different BoW vectors but similar meaning)
- Small datasets with large vocabularies
Rule of thumb: If a pretrained transformer gives similar performance, use the transformer. BoW/TF-IDF is a strong baseline when compute is constrained or interpretability matters.
BoW + ML classics
BoW combined with Naive Bayes or Logistic Regression is a surprisingly strong baseline:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
model = make_pipeline(
TfidfVectorizer(max_features=10000, sublinear_tf=True),
MultinomialNB()
)
model.fit(texts_train, labels_train)MultinomialNB with TF-IDF is often the first thing to try for document classification before going to deep learning.
Limitations
- No word order: “not good” and “good” have identical BoW vectors
- No semantics: “happy” and “glad” are orthogonal vectors (infinite distance)
- High-dimensional sparse vectors: memory inefficient, curse of dimensionality
- No out-of-vocabulary handling: words not in training vocabulary are lost
For any task requiring semantic understanding, use Word Embeddings or Transformers.
Links
- Text Preprocessing
- Text Classification
- Word Embeddings — the dense alternative
- Tokenization — tokenization strategy affects vocabulary