Bag of Words and TF-IDF

Bag of Words (BoW)

Represent text as a vector of word counts. Ignores word order entirely.

from sklearn.feature_extraction.text import CountVectorizer
 
docs = ["I love cats", "I love dogs", "cats and dogs"]
vec = CountVectorizer()
X = vec.fit_transform(docs)
# Sparse matrix: each row is a document, each column is a word index
# Shape: (3, 6) for 3 docs, 6 unique words

Mathematically: document d is represented as vector v where v_i = count of word i in d.

TF-IDF (Term Frequency - Inverse Document Frequency)

Like BoW, but weights words by how informative they are:

  • TF: how often a word appears in this document
  • IDF: how rare/distinctive the word is across all documents
TF-IDF(t, d, D) = TF(t, d) × IDF(t, D)

IDF(t, D) = log(N / df(t)) + 1

where N = total documents, df(t) = number of documents containing term t.

Words that appear everywhere (“the”, “is”, “and”) get low IDF weight. Words unique to a document get high weight.

Variants

Sublinear TF: 1 + log(tf) instead of raw count — dampens the effect of high-frequency terms.

Smooth IDF: log((N + 1) / (df(t) + 1)) + 1 — prevents division by zero for terms not in training corpus.

BM25: The production-grade version used in search engines:

score(D, Q) = Σ IDF(qi) × (tf × (k + 1)) / (tf + k × (1 - b + b × |D|/avgdl))

where k (typically 1.2–2.0) and b (typically 0.75) are hyperparameters.

N-gram extension

BoW/TF-IDF can capture multi-word phrases via n-grams:

vec = CountVectorizer(ngram_range=(1, 2))  # unigrams + bigrams
# "not good" as a bigram captures negation that unigrams miss

Be careful — n-gram vocabulary grows quickly. Use max_features or min_df to prune.

Document frequency filtering

Reduce vocabulary size and noise:

TfidfVectorizer(
    max_features=10000,      # keep top N by frequency
    min_df=2,                # drop terms appearing in < 2 docs
    max_df=0.95,             # drop terms appearing in > 95% of docs
    ngram_range=(1, 2),      # unigrams + bigrams
    sublinear_tf=True,       # use 1 + log(tf)
    stop_words='english'     # remove common English words
)

Low df (rare terms): may be typos, noise, or highly specific vocabulary. High df (ubiquitous terms): stopwords, generic domain language. Low signal.

Implementation

Manual TF-IDF for understanding:

import math
 
def tf_idf(corpus):
    N = len(corpus)
    # Step 1: document frequencies
    df = {}
    for doc in corpus:
        for token in set(doc):
            df[token] = df.get(token, 0) + 1
    
    # Step 2: IDF
    idf = {t: math.log(N / df[t]) + 1 for t in df}
    
    # Step 3: TF-IDF vectors
    vectors = []
    for doc in corpus:
        tf = {}
        for token in doc:
            tf[token] = tf.get(token, 0) + 1
        # Normalize by doc length
        vec = {t: (count/len(doc)) * idf[t] for t, count in tf.items()}
        vectors.append(vec)
    return vectors

When to use BoW/TF-IDF

Good for:

  • Quick baselines for text classification (spam detection, topic classification)
  • When you don’t have GPU or pretrained models
  • Classical ML pipelines (Naive Bayes, Logistic Regression, SVMs)
  • Large-scale information retrieval where sparse matrices are acceptable

Bad for:

  • Tasks requiring word order (“not good” vs “good not”)
  • Tasks requiring semantics (“happy” and “glad” have different BoW vectors but similar meaning)
  • Small datasets with large vocabularies

Rule of thumb: If a pretrained transformer gives similar performance, use the transformer. BoW/TF-IDF is a strong baseline when compute is constrained or interpretability matters.

BoW + ML classics

BoW combined with Naive Bayes or Logistic Regression is a surprisingly strong baseline:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
 
model = make_pipeline(
    TfidfVectorizer(max_features=10000, sublinear_tf=True),
    MultinomialNB()
)
model.fit(texts_train, labels_train)

MultinomialNB with TF-IDF is often the first thing to try for document classification before going to deep learning.

Limitations

  • No word order: “not good” and “good” have identical BoW vectors
  • No semantics: “happy” and “glad” are orthogonal vectors (infinite distance)
  • High-dimensional sparse vectors: memory inefficient, curse of dimensionality
  • No out-of-vocabulary handling: words not in training vocabulary are lost

For any task requiring semantic understanding, use Word Embeddings or Transformers.