TF-IDF

The problem TF-IDF solves

Bag-of-Words counts every word equally. In a corpus of cooking recipes, “the” appears in every document and tells you nothing. “saffron” appears in five and screams “Spanish paella”.

TF-IDF (Term Frequency – Inverse Document Frequency) is a recipe for boosting rare-but-present words and squashing universal ones. Two factors multiplied together — that’s all it is.

flowchart LR
  TF["TF<br/>(Term Frequency)<br/>How often does the word<br/>appear in THIS document?"]
  IDF["IDF<br/>(Inverse Document Frequency)<br/>How rare is the word<br/>across ALL documents?"]
  TF --> M["TF × IDF<br/>(weight of the word<br/>in the document)"]
  IDF --> M
  classDef factor fill:#dbeafe,stroke:#2563eb
  classDef out fill:#d1fae5,stroke:#047857
  TF:::factor
  IDF:::factor
  M:::out

TF-IDF is just two numbers multiplied. The art is in the choice of TF and IDF.

The formula, no surprises

For a word t in a document d inside a corpus of N documents:

TF-IDF(t, d) = TF(t, d) × IDF(t)

Term Frequency — how often t appears in d:

              count of t in d
TF(t, d) = ─────────────────────
            total tokens in d

Inverse Document Frequency — how unusual t is across the corpus:

                          N
IDF(t) = log ─────────────────────────────
              1 + number of docs with t

The log makes the curve gentle — a word that appears in 1 doc out of 1000 doesn’t get a weight 1000× bigger than one that appears in 100 of 1000.

Intuition: if a word is frequent in this document (high TF) but rare in the corpus (high IDF), it is probably what the document is about. That’s exactly what TF-IDF captures.

A worked example, by hand

Three documents:

D1: “the cat sat on the mat”
D2: “the dog sat on the mat”
D3: “cats and dogs are pets”

Look at three words:

Word	TF in D1	Docs containing it	IDF = log(3/(1+df))	TF-IDF in D1
the	2/6 = 0.33	2 of 3	log(3/3) = 0.0	0.0
cat	1/6 = 0.17	1 of 3	log(3/2) ≈ 0.41	0.068
mat	1/6 = 0.17	2 of 3	log(3/3) = 0.0	0.0

Observation: "the" and "mat" both get a TF-IDF of 0 in D1 because they appear in (almost) every document — they carry no information. Only "cat", the word that distinguishes D1 from the others, survives.

That is the whole point of TF-IDF.

TF-IDF in scikit-learn — one line

from sklearn.feature_extraction.text import TfidfVectorizer

docs = [
    "The cat sat on the mat",
    "The dog sat on the mat",
    "Cats and dogs are pets",
]

vec = TfidfVectorizer(stop_words='english', ngram_range=(1, 2))
X = vec.fit_transform(docs)

print(vec.get_feature_names_out())
print(X.toarray().round(2))

The output is a dense matrix where each row is a document and each cell is a TF-IDF weight. Drop it into any classifier:

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', ngram_range=(1, 2), min_df=2)),
    ('model', LogisticRegression(max_iter=1000)),
])
pipe.fit(X_train, y_train)

This pipeline is the strongest 5-line baseline in NLP. On many real-world text classification problems (support tickets, reviews, news), TF-IDF + Logistic Regression beats fancier models, runs in milliseconds, and is fully interpretable.

Important parameters

Parameter	What it does	Sensible default
`stop_words`	Drop common words	`'english'` for English
`ngram_range`	Use bigrams / trigrams	`(1, 2)`
`min_df`	Drop terms in fewer than N docs	`2` (kills typos)
`max_df`	Drop terms in more than X% of docs	`0.95` (auto stop words)
`max_features`	Keep only top-K terms	`20_000` for memory
`sublinear_tf`	`1 + log(tf)` instead of raw tf	`True` for long documents
`norm`	Length normalisation per row	`'l2'` (default, good)

min_df=2 and max_df=0.95 alone clean up 90% of corpus noise.

TF-IDF is the engine of classical search

How does a classical search engine rank documents for the query “cheap red wine”?

Compute the TF-IDF vector of the query.
Compute the TF-IDF vector of each document.
Compute the cosine similarity between query and each document.
Sort documents by similarity, return top-K.

from sklearn.metrics.pairwise import cosine_similarity

query = "cat sleeping on a rug"
q_vec = vec.transform([query])
similarities = cosine_similarity(q_vec, X).ravel()
print(sorted(enumerate(similarities), key=lambda x: -x[1]))
# [(0, 0.34), (1, 0.0), (2, 0.0)]   -> D1 wins

This is essentially what Lucene, Elasticsearch and the classical part of OpenSearch do. Before modern vector search (Lesson 6), this was the way to do search.

What TF-IDF still cannot do

Limitation	Example
No synonyms	”car” and “automobile” are unrelated vectors.
No paraphrase	”high price” and “expensive” have zero similarity.
No context	”bank” (river) and “bank” (money) share the same weight.
No order beyond n-grams	”man bites dog” ≈ “dog bites man”.

To fix these, we need to learn vectors that understand meaning — word embeddings and Transformers. That’s the bridge to Course 2.

Key takeaways

TF-IDF = TF × IDF: frequent in this doc, rare in the corpus.
It automatically down-weights common words — no manual stop-word list needed for the most extreme cases (max_df).
TfidfVectorizer + LogisticRegression is the 5-line baseline to beat.
Cosine similarity on TF-IDF vectors powers classical search.
TF-IDF cannot capture synonyms, paraphrase, or context — that’s what embeddings solve.

Next: Embeddings & Transformers — the bridge from classical NLP to modern LLMs.