Skip to content

Bag-of-Words

Every ML model — from logistic regression to a Transformer — only knows how to multiply numbers. Text is not numbers. The whole job of NLP, before the model, is to turn a document into a vector.

The simplest answer: count the words.

Take three short documents:

  • D1: “The cat sat on the mat.”
  • D2: “The dog sat on the mat.”
  • D3: “Cats and dogs are pets.”

Build the vocabulary (every distinct token, after cleaning):

[the, cat, sat, on, mat, dog, and, are, pets]

For each document, count how many times each vocabulary word appears:

Documentthecatsatonmatdogandarepets
D1211110000
D2201111000
D3010001111

Three documents → three vectors of length 9. Now any classifier can eat them.

flowchart LR
  C["Corpus<br/>of N documents"] --> V["Vocabulary<br/>(all distinct tokens)"]
  V --> M["N × V<br/>count matrix"]
  M --> Cl["Classifier<br/>(logreg, NB, SVM)"]
  classDef src fill:#fde68a,stroke:#c2410c
  classDef build fill:#dbeafe,stroke:#2563eb
  classDef out fill:#d1fae5,stroke:#047857
  C:::src
  V:::build
  M:::build
  Cl:::out
Bag-of-Words in one diagram — count tokens, drop them in a big matrix, train a model.

The name “bag” of words is literal: we throw away word order. “The cat ate the dog” and “The dog ate the cat” produce the same vector. That’s a limitation we’ll fix below.

Using these vectors to find similar documents — cosine similarity by hand

Once each document is a vector, you can compute how similar two documents are by measuring the cosine of the angle between them. Cosine similarity is (A · B) / (||A|| × ||B||), and ranges from 0 (orthogonal, no shared words) to 1 (identical vectors).

Using the three vectors from the table above:

PairDot product A · B‖A‖‖B‖Cosine similarityInterpretation
D1 vs D22·2 + 1·0 + 1·1 + 1·1 + 1·1 + 0·1 = 7√8 ≈ 2.83√8 ≈ 2.830.88Very similar — they share the, sat, on, mat
D1 vs D32·0 + 1·1 + 0 + 0 + 0 + 0 = 1√8 ≈ 2.83√5 ≈ 2.240.16Almost unrelated — only cat is shared
D2 vs D32·0 + 0 + 1·0 + 0 + 0 + 1·1 = 1√8 ≈ 2.83√5 ≈ 2.240.16Same — only dog is shared
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
vec = CountVectorizer()
X = vec.fit_transform([
"The cat sat on the mat",
"The dog sat on the mat",
"Cats and dogs are pets",
])
print(cosine_similarity(X).round(2))
# [[1. 0.88 0.16]
# [0.88 1. 0.16]
# [0.16 0.16 1. ]]

Two takeaways:

  1. Counting words turns text comparison into geometry. Search engines, duplicate detection, recommendation systems all built their first version on this exact computation.
  2. D1 and D2 score 0.88 — but they differ in meaning (cat vs dog). BoW captures surface similarity, not semantic similarity. That gap is the entire reason embeddings (Lesson 6) were invented.
from sklearn.feature_extraction.text import CountVectorizer
docs = [
"The cat sat on the mat.",
"The dog sat on the mat.",
"Cats and dogs are pets.",
]
vec = CountVectorizer(lowercase=True, stop_words='english')
X = vec.fit_transform(docs)
print(vec.get_feature_names_out())
# ['cat' 'cats' 'dog' 'dogs' 'mat' 'pets' 'sat']
print(X.toarray())
# [[1 0 0 0 1 0 1]
# [0 0 1 0 1 0 1]
# [0 1 0 1 0 1 0]]

Two things to notice:

  1. cat and cats are different features. Run a lemmatiser (Lesson 3) before the vectoriser if you want them merged.
  2. the, on, and, are disappeared thanks to stop_words='english'.

X is a sparse matrix: most cells are 0. Real corpora have vocabularies of 50k+ words and documents that contain at most a few hundred — storing 0s would be wasteful.

Fixing the “order doesn’t matter” problem — n-grams

Section titled “Fixing the “order doesn’t matter” problem — n-grams”

A small but powerful trick: don’t count single words, count pairs (bigrams) or triplets (trigrams).

vec = CountVectorizer(ngram_range=(1, 2)) # unigrams + bigrams
X = vec.fit_transform(["not good at all"])
print(vec.get_feature_names_out())
# ['all' 'at' 'at all' 'good' 'good at' 'not' 'not good']

Now "not good" is a single feature. A classifier that has seen "not good" in negative reviews will catch it correctly — something pure single-word BoW cannot.

Trade-off: bigrams roughly square the vocabulary size. Use min_df=2 to drop features that appear in fewer than 2 documents.

End-to-end spam-detection on a tiny toy dataset:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
X_train = [
"Win a free iPhone now click here",
"URGENT: claim your prize today",
"Hi mom, see you at dinner",
"Meeting moved to 3pm thanks",
]
y_train = [1, 1, 0, 0] # 1 = spam
pipe = Pipeline([
('bow', CountVectorizer(ngram_range=(1, 2), stop_words='english')),
('model', LogisticRegression()),
])
pipe.fit(X_train, y_train)
print(pipe.predict(["Free iPhone giveaway click", "see you tomorrow"]))
# [1 0]

Four training examples, and the pipeline already separates spam from real messages. With 10k examples this approach competes with much fancier models — and trains in seconds.

ProblemExampleFix in next lesson
Common words dominate"the" appears in every doc → high count, no signalTF-IDF down-weights common words
All words weighted equally"superb" and "product" count the sameTF-IDF boosts rare, distinctive words
Sparse, huge vocabulary50k-dim vectors for 200-word docsEmbeddings (Lesson 6) compress to dense vectors
No semantic similarity"car" and "automobile" look unrelatedEmbeddings (Lesson 6) put them near each other

Bag-of-Words is the honest baseline. Try it first on any text problem. If you get 85% F1 with BoW + Logistic Regression, you may not need a Transformer.

  • Bag-of-Words = count the words, ignore the order.
  • CountVectorizer turns documents into a sparse N × V matrix.
  • Use n-grams to recover a bit of word order ("not good").
  • BoW is the baseline for any text-classification problem.
  • It struggles with common-word dominance and no semantic similarity — TF-IDF and embeddings fix those.

Next: TF-IDF — weighing words by how informative they are.