TF-IDF
The problem TF-IDF solves
Section titled “The problem TF-IDF solves”Bag-of-Words counts every word equally. In a corpus of cooking recipes, “the” appears in every document and tells you nothing. “saffron” appears in five and screams “Spanish paella”.
TF-IDF (Term Frequency – Inverse Document Frequency) is a recipe for boosting rare-but-present words and squashing universal ones. Two factors multiplied together — that’s all it is.
flowchart LR TF["TF<br/>(Term Frequency)<br/>How often does the word<br/>appear in THIS document?"] IDF["IDF<br/>(Inverse Document Frequency)<br/>How rare is the word<br/>across ALL documents?"] TF --> M["TF × IDF<br/>(weight of the word<br/>in the document)"] IDF --> M classDef factor fill:#dbeafe,stroke:#2563eb classDef out fill:#d1fae5,stroke:#047857 TF:::factor IDF:::factor M:::out
The formula, no surprises
Section titled “The formula, no surprises”For a word t in a document d inside a corpus of N documents:
TF-IDF(t, d) = TF(t, d) × IDF(t)Term Frequency — how often t appears in d:
count of t in dTF(t, d) = ───────────────────── total tokens in dInverse Document Frequency — how unusual t is across the corpus:
NIDF(t) = log ───────────────────────────── 1 + number of docs with tThe log makes the curve gentle — a word that appears in 1 doc out of 1000 doesn’t get a weight 1000× bigger than one that appears in 100 of 1000.
Intuition: if a word is frequent in this document (high TF) but rare in the corpus (high IDF), it is probably what the document is about. That’s exactly what TF-IDF captures.
A worked example, by hand
Section titled “A worked example, by hand”Three documents:
- D1: “the cat sat on the mat”
- D2: “the dog sat on the mat”
- D3: “cats and dogs are pets”
Look at three words:
| Word | TF in D1 | Docs containing it | IDF = log(3/(1+df)) | TF-IDF in D1 |
|---|---|---|---|---|
| the | 2/6 = 0.33 | 2 of 3 | log(3/3) = 0.0 | 0.0 |
| cat | 1/6 = 0.17 | 1 of 3 | log(3/2) ≈ 0.41 | 0.068 |
| mat | 1/6 = 0.17 | 2 of 3 | log(3/3) = 0.0 | 0.0 |
Observation: "the" and "mat" both get a TF-IDF of 0 in D1 because they appear in (almost) every document — they carry no information. Only "cat", the word that distinguishes D1 from the others, survives.
That is the whole point of TF-IDF.
TF-IDF in scikit-learn — one line
Section titled “TF-IDF in scikit-learn — one line”from sklearn.feature_extraction.text import TfidfVectorizer
docs = [ "The cat sat on the mat", "The dog sat on the mat", "Cats and dogs are pets",]
vec = TfidfVectorizer(stop_words='english', ngram_range=(1, 2))X = vec.fit_transform(docs)
print(vec.get_feature_names_out())print(X.toarray().round(2))The output is a dense matrix where each row is a document and each cell is a TF-IDF weight. Drop it into any classifier:
from sklearn.linear_model import LogisticRegressionfrom sklearn.pipeline import Pipeline
pipe = Pipeline([ ('tfidf', TfidfVectorizer(stop_words='english', ngram_range=(1, 2), min_df=2)), ('model', LogisticRegression(max_iter=1000)),])pipe.fit(X_train, y_train)This pipeline is the strongest 5-line baseline in NLP. On many real-world text classification problems (support tickets, reviews, news), TF-IDF + Logistic Regression beats fancier models, runs in milliseconds, and is fully interpretable.
Important parameters
Section titled “Important parameters”| Parameter | What it does | Sensible default |
|---|---|---|
stop_words | Drop common words | 'english' for English |
ngram_range | Use bigrams / trigrams | (1, 2) |
min_df | Drop terms in fewer than N docs | 2 (kills typos) |
max_df | Drop terms in more than X% of docs | 0.95 (auto stop words) |
max_features | Keep only top-K terms | 20_000 for memory |
sublinear_tf | 1 + log(tf) instead of raw tf | True for long documents |
norm | Length normalisation per row | 'l2' (default, good) |
min_df=2 and max_df=0.95 alone clean up 90% of corpus noise.
TF-IDF is the engine of classical search
Section titled “TF-IDF is the engine of classical search”How does a classical search engine rank documents for the query “cheap red wine”?
- Compute the TF-IDF vector of the query.
- Compute the TF-IDF vector of each document.
- Compute the cosine similarity between query and each document.
- Sort documents by similarity, return top-K.
from sklearn.metrics.pairwise import cosine_similarity
query = "cat sleeping on a rug"q_vec = vec.transform([query])similarities = cosine_similarity(q_vec, X).ravel()print(sorted(enumerate(similarities), key=lambda x: -x[1]))# [(0, 0.34), (1, 0.0), (2, 0.0)] -> D1 winsThis is essentially what Lucene, Elasticsearch and the classical part of OpenSearch do. Before modern vector search (Lesson 6), this was the way to do search.
What TF-IDF still cannot do
Section titled “What TF-IDF still cannot do”| Limitation | Example |
|---|---|
| No synonyms | ”car” and “automobile” are unrelated vectors. |
| No paraphrase | ”high price” and “expensive” have zero similarity. |
| No context | ”bank” (river) and “bank” (money) share the same weight. |
| No order beyond n-grams | ”man bites dog” ≈ “dog bites man”. |
To fix these, we need to learn vectors that understand meaning — word embeddings and Transformers. That’s the bridge to Course 2.
Key takeaways
Section titled “Key takeaways”- TF-IDF = TF × IDF: frequent in this doc, rare in the corpus.
- It automatically down-weights common words — no manual stop-word list needed for the most extreme cases (
max_df). TfidfVectorizer + LogisticRegressionis the 5-line baseline to beat.- Cosine similarity on TF-IDF vectors powers classical search.
- TF-IDF cannot capture synonyms, paraphrase, or context — that’s what embeddings solve.
Next: Embeddings & Transformers — the bridge from classical NLP to modern LLMs.