Bag-of-Words
Why text needs to become numbers
Section titled “Why text needs to become numbers”Every ML model — from logistic regression to a Transformer — only knows how to multiply numbers. Text is not numbers. The whole job of NLP, before the model, is to turn a document into a vector.
The simplest answer: count the words.
The Bag-of-Words idea
Section titled “The Bag-of-Words idea”Take three short documents:
- D1: “The cat sat on the mat.”
- D2: “The dog sat on the mat.”
- D3: “Cats and dogs are pets.”
Build the vocabulary (every distinct token, after cleaning):
[the, cat, sat, on, mat, dog, and, are, pets]For each document, count how many times each vocabulary word appears:
| Document | the | cat | sat | on | mat | dog | and | are | pets |
|---|---|---|---|---|---|---|---|---|---|
| D1 | 2 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
| D2 | 2 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
| D3 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
Three documents → three vectors of length 9. Now any classifier can eat them.
flowchart LR C["Corpus<br/>of N documents"] --> V["Vocabulary<br/>(all distinct tokens)"] V --> M["N × V<br/>count matrix"] M --> Cl["Classifier<br/>(logreg, NB, SVM)"] classDef src fill:#fde68a,stroke:#c2410c classDef build fill:#dbeafe,stroke:#2563eb classDef out fill:#d1fae5,stroke:#047857 C:::src V:::build M:::build Cl:::out
The name “bag” of words is literal: we throw away word order. “The cat ate the dog” and “The dog ate the cat” produce the same vector. That’s a limitation we’ll fix below.
Using these vectors to find similar documents — cosine similarity by hand
Once each document is a vector, you can compute how similar two documents are by measuring the cosine of the angle between them. Cosine similarity is (A · B) / (||A|| × ||B||), and ranges from 0 (orthogonal, no shared words) to 1 (identical vectors).
Using the three vectors from the table above:
| Pair | Dot product A · B | ‖A‖ | ‖B‖ | Cosine similarity | Interpretation |
|---|---|---|---|---|---|
| D1 vs D2 | 2·2 + 1·0 + 1·1 + 1·1 + 1·1 + 0·1 = 7 | √8 ≈ 2.83 | √8 ≈ 2.83 | 0.88 | Very similar — they share the, sat, on, mat |
| D1 vs D3 | 2·0 + 1·1 + 0 + 0 + 0 + 0 = 1 | √8 ≈ 2.83 | √5 ≈ 2.24 | 0.16 | Almost unrelated — only cat is shared |
| D2 vs D3 | 2·0 + 0 + 1·0 + 0 + 0 + 1·1 = 1 | √8 ≈ 2.83 | √5 ≈ 2.24 | 0.16 | Same — only dog is shared |
from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.metrics.pairwise import cosine_similarity
vec = CountVectorizer()X = vec.fit_transform([ "The cat sat on the mat", "The dog sat on the mat", "Cats and dogs are pets",])print(cosine_similarity(X).round(2))# [[1. 0.88 0.16]# [0.88 1. 0.16]# [0.16 0.16 1. ]]Two takeaways:
- Counting words turns text comparison into geometry. Search engines, duplicate detection, recommendation systems all built their first version on this exact computation.
- D1 and D2 score 0.88 — but they differ in meaning (cat vs dog). BoW captures surface similarity, not semantic similarity. That gap is the entire reason embeddings (Lesson 6) were invented.
Bag-of-Words in scikit-learn
Section titled “Bag-of-Words in scikit-learn”from sklearn.feature_extraction.text import CountVectorizer
docs = [ "The cat sat on the mat.", "The dog sat on the mat.", "Cats and dogs are pets.",]
vec = CountVectorizer(lowercase=True, stop_words='english')X = vec.fit_transform(docs)
print(vec.get_feature_names_out())# ['cat' 'cats' 'dog' 'dogs' 'mat' 'pets' 'sat']
print(X.toarray())# [[1 0 0 0 1 0 1]# [0 0 1 0 1 0 1]# [0 1 0 1 0 1 0]]Two things to notice:
catandcatsare different features. Run a lemmatiser (Lesson 3) before the vectoriser if you want them merged.the,on,and,aredisappeared thanks tostop_words='english'.
X is a sparse matrix: most cells are 0. Real corpora have vocabularies of 50k+ words and documents that contain at most a few hundred — storing 0s would be wasteful.
Fixing the “order doesn’t matter” problem — n-grams
Section titled “Fixing the “order doesn’t matter” problem — n-grams”A small but powerful trick: don’t count single words, count pairs (bigrams) or triplets (trigrams).
vec = CountVectorizer(ngram_range=(1, 2)) # unigrams + bigramsX = vec.fit_transform(["not good at all"])print(vec.get_feature_names_out())# ['all' 'at' 'at all' 'good' 'good at' 'not' 'not good']Now "not good" is a single feature. A classifier that has seen "not good" in negative reviews will catch it correctly — something pure single-word BoW cannot.
Trade-off: bigrams roughly square the vocabulary size. Use min_df=2 to drop features that appear in fewer than 2 documents.
A complete classification example
Section titled “A complete classification example”End-to-end spam-detection on a tiny toy dataset:
from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.linear_model import LogisticRegressionfrom sklearn.pipeline import Pipeline
X_train = [ "Win a free iPhone now click here", "URGENT: claim your prize today", "Hi mom, see you at dinner", "Meeting moved to 3pm thanks",]y_train = [1, 1, 0, 0] # 1 = spam
pipe = Pipeline([ ('bow', CountVectorizer(ngram_range=(1, 2), stop_words='english')), ('model', LogisticRegression()),])pipe.fit(X_train, y_train)
print(pipe.predict(["Free iPhone giveaway click", "see you tomorrow"]))# [1 0]Four training examples, and the pipeline already separates spam from real messages. With 10k examples this approach competes with much fancier models — and trains in seconds.
What Bag-of-Words still gets wrong
Section titled “What Bag-of-Words still gets wrong”| Problem | Example | Fix in next lesson |
|---|---|---|
| Common words dominate | "the" appears in every doc → high count, no signal | TF-IDF down-weights common words |
| All words weighted equally | "superb" and "product" count the same | TF-IDF boosts rare, distinctive words |
| Sparse, huge vocabulary | 50k-dim vectors for 200-word docs | Embeddings (Lesson 6) compress to dense vectors |
| No semantic similarity | "car" and "automobile" look unrelated | Embeddings (Lesson 6) put them near each other |
Bag-of-Words is the honest baseline. Try it first on any text problem. If you get 85% F1 with BoW + Logistic Regression, you may not need a Transformer.
Key takeaways
Section titled “Key takeaways”- Bag-of-Words = count the words, ignore the order.
CountVectorizerturns documents into a sparse N × V matrix.- Use n-grams to recover a bit of word order (
"not good"). - BoW is the baseline for any text-classification problem.
- It struggles with common-word dominance and no semantic similarity — TF-IDF and embeddings fix those.
Next: TF-IDF — weighing words by how informative they are.