Bag-of-Words

Why text needs to become numbers

Every ML model — from logistic regression to a Transformer — only knows how to multiply numbers. Text is not numbers. The whole job of NLP, before the model, is to turn a document into a vector.

The simplest answer: count the words.

The Bag-of-Words idea

Take three short documents:

D1: “The cat sat on the mat.”
D2: “The dog sat on the mat.”
D3: “Cats and dogs are pets.”

Build the vocabulary (every distinct token, after cleaning):

[the, cat, sat, on, mat, dog, and, are, pets]

For each document, count how many times each vocabulary word appears:

Document	the	cat	sat	on	mat	dog	and	are	pets
D1	2	1	1	1	1	0	0	0	0
D2	2	0	1	1	1	1	0	0	0
D3	0	1	0	0	0	1	1	1	1

Three documents → three vectors of length 9. Now any classifier can eat them.

flowchart LR
  C["Corpus<br/>of N documents"] --> V["Vocabulary<br/>(all distinct tokens)"]
  V --> M["N × V<br/>count matrix"]
  M --> Cl["Classifier<br/>(logreg, NB, SVM)"]
  classDef src fill:#fde68a,stroke:#c2410c
  classDef build fill:#dbeafe,stroke:#2563eb
  classDef out fill:#d1fae5,stroke:#047857
  C:::src
  V:::build
  M:::build
  Cl:::out

Bag-of-Words in one diagram — count tokens, drop them in a big matrix, train a model.

The name “bag” of words is literal: we throw away word order. “The cat ate the dog” and “The dog ate the cat” produce the same vector. That’s a limitation we’ll fix below.

Using these vectors to find similar documents — cosine similarity by hand

Once each document is a vector, you can compute how similar two documents are by measuring the cosine of the angle between them. Cosine similarity is (A · B) / (||A|| × ||B||), and ranges from 0 (orthogonal, no shared words) to 1 (identical vectors).

Using the three vectors from the table above:

Pair	Dot product `A · B`	`‖A‖`	`‖B‖`	Cosine similarity	Interpretation
D1 vs D2	2·2 + 1·0 + 1·1 + 1·1 + 1·1 + 0·1 = 7	√8 ≈ 2.83	√8 ≈ 2.83	0.88	Very similar — they share `the, sat, on, mat`
D1 vs D3	2·0 + 1·1 + 0 + 0 + 0 + 0 = 1	√8 ≈ 2.83	√5 ≈ 2.24	0.16	Almost unrelated — only `cat` is shared
D2 vs D3	2·0 + 0 + 1·0 + 0 + 0 + 1·1 = 1	√8 ≈ 2.83	√5 ≈ 2.24	0.16	Same — only `dog` is shared

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

vec = CountVectorizer()
X = vec.fit_transform([
    "The cat sat on the mat",
    "The dog sat on the mat",
    "Cats and dogs are pets",
])
print(cosine_similarity(X).round(2))
# [[1.   0.88 0.16]
#  [0.88 1.   0.16]
#  [0.16 0.16 1.  ]]

Two takeaways:

Counting words turns text comparison into geometry. Search engines, duplicate detection, recommendation systems all built their first version on this exact computation.
D1 and D2 score 0.88 — but they differ in meaning (cat vs dog). BoW captures surface similarity, not semantic similarity. That gap is the entire reason embeddings (Lesson 6) were invented.

Bag-of-Words in scikit-learn

from sklearn.feature_extraction.text import CountVectorizer

docs = [
    "The cat sat on the mat.",
    "The dog sat on the mat.",
    "Cats and dogs are pets.",
]

vec = CountVectorizer(lowercase=True, stop_words='english')
X = vec.fit_transform(docs)

print(vec.get_feature_names_out())
# ['cat' 'cats' 'dog' 'dogs' 'mat' 'pets' 'sat']

print(X.toarray())
# [[1 0 0 0 1 0 1]
#  [0 0 1 0 1 0 1]
#  [0 1 0 1 0 1 0]]

Two things to notice:

cat and cats are different features. Run a lemmatiser (Lesson 3) before the vectoriser if you want them merged.
the, on, and, are disappeared thanks to stop_words='english'.

X is a sparse matrix: most cells are 0. Real corpora have vocabularies of 50k+ words and documents that contain at most a few hundred — storing 0s would be wasteful.

Fixing the “order doesn’t matter” problem — n-grams

A small but powerful trick: don’t count single words, count pairs (bigrams) or triplets (trigrams).

vec = CountVectorizer(ngram_range=(1, 2))   # unigrams + bigrams
X = vec.fit_transform(["not good at all"])
print(vec.get_feature_names_out())
# ['all' 'at' 'at all' 'good' 'good at' 'not' 'not good']

Now "not good" is a single feature. A classifier that has seen "not good" in negative reviews will catch it correctly — something pure single-word BoW cannot.

Trade-off: bigrams roughly square the vocabulary size. Use min_df=2 to drop features that appear in fewer than 2 documents.

A complete classification example

End-to-end spam-detection on a tiny toy dataset:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

X_train = [
    "Win a free iPhone now click here",
    "URGENT: claim your prize today",
    "Hi mom, see you at dinner",
    "Meeting moved to 3pm thanks",
]
y_train = [1, 1, 0, 0]   # 1 = spam

pipe = Pipeline([
    ('bow',   CountVectorizer(ngram_range=(1, 2), stop_words='english')),
    ('model', LogisticRegression()),
])
pipe.fit(X_train, y_train)

print(pipe.predict(["Free iPhone giveaway click", "see you tomorrow"]))
# [1 0]

Four training examples, and the pipeline already separates spam from real messages. With 10k examples this approach competes with much fancier models — and trains in seconds.

What Bag-of-Words still gets wrong

Problem	Example	Fix in next lesson
Common words dominate	`"the"` appears in every doc → high count, no signal	TF-IDF down-weights common words
All words weighted equally	`"superb"` and `"product"` count the same	TF-IDF boosts rare, distinctive words
Sparse, huge vocabulary	50k-dim vectors for 200-word docs	Embeddings (Lesson 6) compress to dense vectors
No semantic similarity	`"car"` and `"automobile"` look unrelated	Embeddings (Lesson 6) put them near each other

Bag-of-Words is the honest baseline. Try it first on any text problem. If you get 85% F1 with BoW + Logistic Regression, you may not need a Transformer.

Key takeaways

Bag-of-Words = count the words, ignore the order.
CountVectorizer turns documents into a sparse N × V matrix.
Use n-grams to recover a bit of word order ("not good").
BoW is the baseline for any text-classification problem.
It struggles with common-word dominance and no semantic similarity — TF-IDF and embeddings fix those.

Next: TF-IDF — weighing words by how informative they are.