Stop words, stemming, lemmatization

The vocabulary problem

After tokenising a corpus you typically end up with tens of thousands of distinct tokens. Most carry little signal — “the” appears in every document, and “running”, “runs”, “ran” look like three different words to the computer. Three classical tools to fix this:

flowchart TB
  V["Vocabulary too big<br/>+ noise"]
  V --> S["Stop words<br/>(remove very common words)"]
  V --> St["Stemming<br/>(crude root extraction)"]
  V --> L["Lemmatisation<br/>(grammatically-aware root)"]
  S --> R["Smaller, cleaner vocabulary"]
  St --> R
  L --> R
  classDef problem fill:#fee2e2,stroke:#dc2626
  classDef tool fill:#dbeafe,stroke:#2563eb
  classDef out fill:#d1fae5,stroke:#047857
  V:::problem
  S:::tool
  St:::tool
  L:::tool
  R:::out

Three classical tools that shrink the vocabulary and the noise.

1. Stop words — kill the most common words

Stop words are extremely common words that rarely change the meaning of a document: the, a, of, and, in, to, is, that… Removing them shrinks your vocabulary and forces the model to focus on the words that matter.

from nltk.corpus import stopwords
import nltk
nltk.download('stopwords', quiet=True)

stop = set(stopwords.words('english'))

text = "The cat is sitting on the mat".lower().split()
clean = [w for w in text if w not in stop]
print(clean)
# ['cat', 'sitting', 'mat']

Build your own list

The default NLTK list has ~180 English stop words. Often you want to add or remove some:

custom = stop | {"http", "https", "www", "rt"}  # tweets
custom -= {"not", "no"}                          # keep negations

When to NOT remove stop words

Task	Stop words?
Topic classification (news, support tickets)	Remove — they add noise
TF-IDF search	Remove — they dominate counts
Sentiment analysis	Keep — “not good” ≠ “good”
Question answering	Keep — “who is…” matters
Machine translation / generation	Keep everything
Any Transformer / LLM workflow	Keep everything — the model handles it

The golden rule: stop-word removal was useful for Bag-of-Words and TF-IDF. It’s almost always wrong for neural models.

2. Stemming — chop the suffix

A stemmer mechanically chops the end of a word to (try to) reach its root. Fast, dumb, no dictionary required.

from nltk.stem import PorterStemmer
ps = PorterStemmer()

words = ["running", "runs", "ran", "runner", "happily", "studies"]
print([ps.stem(w) for w in words])
# ['run', 'run', 'ran', 'runner', 'happili', 'studi']

Three observations:

running and runs collapse to run — good, that’s the point.
ran stays ran — Porter doesn’t know irregular verbs.
happily → happili and studies → studi — stems are not real words. The computer doesn’t care; it just wants the same key for variants.

Popular stemmers

Stemmer	Speed	Aggressiveness	Languages
Porter	Fast	Medium	English only
Snowball	Fast	Medium (improved Porter)	17 languages incl. French
Lancaster	Fast	Very aggressive	English

For French:

from nltk.stem.snowball import FrenchStemmer
fs = FrenchStemmer()
print(fs.stem("courons"))   # 'cour'
print(fs.stem("mangeaient"))# 'mang'

3. Lemmatization — the dictionary version

A lemmatiser uses a dictionary and the part-of-speech (POS) tag to return the actual canonical form (the lemma).

import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp("The runners ran happily while studying.")
for t in doc:
    print(f"{t.text:10s} -> {t.lemma_}")
# The        -> the
# runners    -> runner
# ran        -> run
# happily    -> happily
# while      -> while
# studying   -> study

Compared to stemming: ran → run, studying → study, happily → happily (correctly kept as adverb). Lemmas are real words.

Stemming vs lemmatisation — pick one

	Stemming	Lemmatisation
Speed	Very fast	10–100× slower
Output	Often not a real word	Always a real word
Quality	Approximate	Linguistically correct
Dependencies	None	Dictionary + POS tagger
Languages	Many	Fewer, but the major ones are covered
Use case	Search, exploratory analysis	Production NLP pipelines

Default rule: use lemmatisation if you can afford it (spaCy is fast enough for most projects). Use stemming only if you need raw speed on huge data.

One sentence through the four steps, side by side

The same input phrase processed through each tool, token by token:

Input: "The runners were running faster and the studies showed they enjoyed it."

Step	Output
1. Tokenisation (split)	`[The, runners, were, running, faster, and, the, studies, showed, they, enjoyed, it, .]`
2. After stop-word removal	`[runners, running, faster, studies, showed, enjoyed, .]`
3. After stemming (Porter)	`[runner, run, faster, studi, show, enjoy, .]`
4. After lemmatisation (spaCy)	`[runner, run, fast, study, show, enjoy, .]`

Token-by-token comparison of the two normalisations:

Original	Porter stem	spaCy lemma	Comment
`runners`	`runner`	`runner`	Same result
`running`	`run`	`run`	Same result
`faster`	`faster`	`fast`	Lemma understands “faster” is a degree of “fast” — stemmer just chops
`studies`	`studi`	`study`	Stem is not a real word — lemma is
`showed`	`show`	`show`	Same result
`enjoyed`	`enjoy`	`enjoy`	Same result

from nltk.stem import PorterStemmer
import spacy

ps = PorterStemmer()
nlp = spacy.load("en_core_web_sm")
tokens = ["runners", "running", "faster", "studies", "showed", "enjoyed"]

print("Stem:  ", [ps.stem(w) for w in tokens])
print("Lemma: ", [t.lemma_ for t in nlp(" ".join(tokens))])
# Stem:   ['runner', 'run', 'faster', 'studi', 'show', 'enjoy']
# Lemma:  ['runner', 'run', 'fast', 'study', 'show', 'enjoy']

Two takeaways:

On 4 of 6 tokens the two tools agree. Most of the time, either choice is fine.
Stemming fails silently on irregular morphology (faster → faster, studies → studi). If your downstream consumer is a human (search results, word clouds), lemmatisation is worth the extra cost.

A complete cleaning pipeline

Putting everything from Lessons 2 and 3 together:

import re, unicodedata
import spacy
from nltk.corpus import stopwords

nlp = spacy.load("en_core_web_sm")
STOP = set(stopwords.words('english'))

def clean(text: str) -> list[str]:
    text = text.lower()
    text = "".join(c for c in unicodedata.normalize("NFKD", text)
                   if not unicodedata.combining(c))
    text = re.sub(r"[^\w\s]", " ", text)
    doc  = nlp(text)
    return [t.lemma_ for t in doc
            if t.lemma_ not in STOP
            and not t.is_space
            and t.lemma_.strip()]

print(clean("The runners ran happily while studying NLP."))
# ['runner', 'run', 'happily', 'study', 'nlp']

This is the chain feeding into Bag-of-Words and TF-IDF — coming up next.

When NOT to use any of this

If you are going to feed text to a Transformer / LLM (BERT, GPT, Llama, sentence-transformers…), skip all of it. Let the model see the raw text. These models were trained with stop words and inflections — removing them throws away signal.

The whole linguistic-cleaning toolkit is for classical ML pipelines: TF-IDF + logistic regression, word clouds, simple search engines. Which is exactly what we’ll build in the next two lessons.

Key takeaways

Stop words are extremely common words you can usually drop for BoW/TF-IDF.
Stemming = fast, dumb chopping. Output may not be a real word.
Lemmatisation = uses a dictionary. Output is a real word. Slower but better.
Skip all three for sentiment, generation, translation, or any Transformer pipeline.
The cleaning pipeline (Lessons 2 + 3) is the input to the vectorisation pipeline (Lessons 4 + 5).

Next: Bag-of-Words — the first way to turn cleaned text into numbers.