Skip to content

TF-IDF

Bag-of-Words counts every word equally. In a corpus of cooking recipes, “the” appears in every document and tells you nothing. “saffron” appears in five and screams “Spanish paella”.

TF-IDF (Term Frequency – Inverse Document Frequency) is a recipe for boosting rare-but-present words and squashing universal ones. Two factors multiplied together — that’s all it is.

flowchart LR
  TF["TF<br/>(Term Frequency)<br/>How often does the word<br/>appear in THIS document?"]
  IDF["IDF<br/>(Inverse Document Frequency)<br/>How rare is the word<br/>across ALL documents?"]
  TF --> M["TF × IDF<br/>(weight of the word<br/>in the document)"]
  IDF --> M
  classDef factor fill:#dbeafe,stroke:#2563eb
  classDef out fill:#d1fae5,stroke:#047857
  TF:::factor
  IDF:::factor
  M:::out
TF-IDF is just two numbers multiplied. The art is in the choice of TF and IDF.

For a word t in a document d inside a corpus of N documents:

TF-IDF(t, d) = TF(t, d) × IDF(t)

Term Frequency — how often t appears in d:

count of t in d
TF(t, d) = ─────────────────────
total tokens in d

Inverse Document Frequency — how unusual t is across the corpus:

N
IDF(t) = log ─────────────────────────────
1 + number of docs with t

The log makes the curve gentle — a word that appears in 1 doc out of 1000 doesn’t get a weight 1000× bigger than one that appears in 100 of 1000.

Intuition: if a word is frequent in this document (high TF) but rare in the corpus (high IDF), it is probably what the document is about. That’s exactly what TF-IDF captures.

Three documents:

  • D1: “the cat sat on the mat”
  • D2: “the dog sat on the mat”
  • D3: “cats and dogs are pets”

Look at three words:

WordTF in D1Docs containing itIDF = log(3/(1+df))TF-IDF in D1
the2/6 = 0.332 of 3log(3/3) = 0.00.0
cat1/6 = 0.171 of 3log(3/2) ≈ 0.410.068
mat1/6 = 0.172 of 3log(3/3) = 0.00.0

Observation: "the" and "mat" both get a TF-IDF of 0 in D1 because they appear in (almost) every document — they carry no information. Only "cat", the word that distinguishes D1 from the others, survives.

That is the whole point of TF-IDF.

from sklearn.feature_extraction.text import TfidfVectorizer
docs = [
"The cat sat on the mat",
"The dog sat on the mat",
"Cats and dogs are pets",
]
vec = TfidfVectorizer(stop_words='english', ngram_range=(1, 2))
X = vec.fit_transform(docs)
print(vec.get_feature_names_out())
print(X.toarray().round(2))

The output is a dense matrix where each row is a document and each cell is a TF-IDF weight. Drop it into any classifier:

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
pipe = Pipeline([
('tfidf', TfidfVectorizer(stop_words='english', ngram_range=(1, 2), min_df=2)),
('model', LogisticRegression(max_iter=1000)),
])
pipe.fit(X_train, y_train)

This pipeline is the strongest 5-line baseline in NLP. On many real-world text classification problems (support tickets, reviews, news), TF-IDF + Logistic Regression beats fancier models, runs in milliseconds, and is fully interpretable.

ParameterWhat it doesSensible default
stop_wordsDrop common words'english' for English
ngram_rangeUse bigrams / trigrams(1, 2)
min_dfDrop terms in fewer than N docs2 (kills typos)
max_dfDrop terms in more than X% of docs0.95 (auto stop words)
max_featuresKeep only top-K terms20_000 for memory
sublinear_tf1 + log(tf) instead of raw tfTrue for long documents
normLength normalisation per row'l2' (default, good)

min_df=2 and max_df=0.95 alone clean up 90% of corpus noise.

How does a classical search engine rank documents for the query “cheap red wine”?

  1. Compute the TF-IDF vector of the query.
  2. Compute the TF-IDF vector of each document.
  3. Compute the cosine similarity between query and each document.
  4. Sort documents by similarity, return top-K.
from sklearn.metrics.pairwise import cosine_similarity
query = "cat sleeping on a rug"
q_vec = vec.transform([query])
similarities = cosine_similarity(q_vec, X).ravel()
print(sorted(enumerate(similarities), key=lambda x: -x[1]))
# [(0, 0.34), (1, 0.0), (2, 0.0)] -> D1 wins

This is essentially what Lucene, Elasticsearch and the classical part of OpenSearch do. Before modern vector search (Lesson 6), this was the way to do search.

LimitationExample
No synonyms”car” and “automobile” are unrelated vectors.
No paraphrase”high price” and “expensive” have zero similarity.
No context”bank” (river) and “bank” (money) share the same weight.
No order beyond n-grams”man bites dog”“dog bites man”.

To fix these, we need to learn vectors that understand meaningword embeddings and Transformers. That’s the bridge to Course 2.

  • TF-IDF = TF × IDF: frequent in this doc, rare in the corpus.
  • It automatically down-weights common words — no manual stop-word list needed for the most extreme cases (max_df).
  • TfidfVectorizer + LogisticRegression is the 5-line baseline to beat.
  • Cosine similarity on TF-IDF vectors powers classical search.
  • TF-IDF cannot capture synonyms, paraphrase, or context — that’s what embeddings solve.

Next: Embeddings & Transformers — the bridge from classical NLP to modern LLMs.