Skip to content

Stop words, stemming, lemmatization

After tokenising a corpus you typically end up with tens of thousands of distinct tokens. Most carry little signal — “the” appears in every document, and “running”, “runs”, “ran” look like three different words to the computer. Three classical tools to fix this:

flowchart TB
  V["Vocabulary too big<br/>+ noise"]
  V --> S["Stop words<br/>(remove very common words)"]
  V --> St["Stemming<br/>(crude root extraction)"]
  V --> L["Lemmatisation<br/>(grammatically-aware root)"]
  S --> R["Smaller, cleaner vocabulary"]
  St --> R
  L --> R
  classDef problem fill:#fee2e2,stroke:#dc2626
  classDef tool fill:#dbeafe,stroke:#2563eb
  classDef out fill:#d1fae5,stroke:#047857
  V:::problem
  S:::tool
  St:::tool
  L:::tool
  R:::out
Three classical tools that shrink the vocabulary and the noise.

1. Stop words — kill the most common words

Section titled “1. Stop words — kill the most common words”

Stop words are extremely common words that rarely change the meaning of a document: the, a, of, and, in, to, is, that… Removing them shrinks your vocabulary and forces the model to focus on the words that matter.

from nltk.corpus import stopwords
import nltk
nltk.download('stopwords', quiet=True)
stop = set(stopwords.words('english'))
text = "The cat is sitting on the mat".lower().split()
clean = [w for w in text if w not in stop]
print(clean)
# ['cat', 'sitting', 'mat']

The default NLTK list has ~180 English stop words. Often you want to add or remove some:

custom = stop | {"http", "https", "www", "rt"} # tweets
custom -= {"not", "no"} # keep negations
TaskStop words?
Topic classification (news, support tickets)Remove — they add noise
TF-IDF searchRemove — they dominate counts
Sentiment analysisKeep — “not good” ≠ “good”
Question answeringKeep“who is…” matters
Machine translation / generationKeep everything
Any Transformer / LLM workflowKeep everything — the model handles it

The golden rule: stop-word removal was useful for Bag-of-Words and TF-IDF. It’s almost always wrong for neural models.

A stemmer mechanically chops the end of a word to (try to) reach its root. Fast, dumb, no dictionary required.

from nltk.stem import PorterStemmer
ps = PorterStemmer()
words = ["running", "runs", "ran", "runner", "happily", "studies"]
print([ps.stem(w) for w in words])
# ['run', 'run', 'ran', 'runner', 'happili', 'studi']

Three observations:

  1. running and runs collapse to rungood, that’s the point.
  2. ran stays ran — Porter doesn’t know irregular verbs.
  3. happilyhappili and studiesstudi — stems are not real words. The computer doesn’t care; it just wants the same key for variants.
StemmerSpeedAggressivenessLanguages
PorterFastMediumEnglish only
SnowballFastMedium (improved Porter)17 languages incl. French
LancasterFastVery aggressiveEnglish

For French:

from nltk.stem.snowball import FrenchStemmer
fs = FrenchStemmer()
print(fs.stem("courons")) # 'cour'
print(fs.stem("mangeaient"))# 'mang'

3. Lemmatization — the dictionary version

Section titled “3. Lemmatization — the dictionary version”

A lemmatiser uses a dictionary and the part-of-speech (POS) tag to return the actual canonical form (the lemma).

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("The runners ran happily while studying.")
for t in doc:
print(f"{t.text:10s} -> {t.lemma_}")
# The -> the
# runners -> runner
# ran -> run
# happily -> happily
# while -> while
# studying -> study

Compared to stemming: ran → run, studying → study, happily → happily (correctly kept as adverb). Lemmas are real words.

StemmingLemmatisation
SpeedVery fast10–100× slower
OutputOften not a real wordAlways a real word
QualityApproximateLinguistically correct
DependenciesNoneDictionary + POS tagger
LanguagesManyFewer, but the major ones are covered
Use caseSearch, exploratory analysisProduction NLP pipelines

Default rule: use lemmatisation if you can afford it (spaCy is fast enough for most projects). Use stemming only if you need raw speed on huge data.

One sentence through the four steps, side by side

The same input phrase processed through each tool, token by token:

Input: "The runners were running faster and the studies showed they enjoyed it."

StepOutput
1. Tokenisation (split)[The, runners, were, running, faster, and, the, studies, showed, they, enjoyed, it, .]
2. After stop-word removal[runners, running, faster, studies, showed, enjoyed, .]
3. After stemming (Porter)[runner, run, faster, studi, show, enjoy, .]
4. After lemmatisation (spaCy)[runner, run, fast, study, show, enjoy, .]

Token-by-token comparison of the two normalisations:

OriginalPorter stemspaCy lemmaComment
runnersrunnerrunnerSame result
runningrunrunSame result
fasterfasterfastLemma understands “faster” is a degree of “fast” — stemmer just chops
studiesstudistudyStem is not a real word — lemma is
showedshowshowSame result
enjoyedenjoyenjoySame result
from nltk.stem import PorterStemmer
import spacy
ps = PorterStemmer()
nlp = spacy.load("en_core_web_sm")
tokens = ["runners", "running", "faster", "studies", "showed", "enjoyed"]
print("Stem: ", [ps.stem(w) for w in tokens])
print("Lemma: ", [t.lemma_ for t in nlp(" ".join(tokens))])
# Stem: ['runner', 'run', 'faster', 'studi', 'show', 'enjoy']
# Lemma: ['runner', 'run', 'fast', 'study', 'show', 'enjoy']

Two takeaways:

  1. On 4 of 6 tokens the two tools agree. Most of the time, either choice is fine.
  2. Stemming fails silently on irregular morphology (faster → faster, studies → studi). If your downstream consumer is a human (search results, word clouds), lemmatisation is worth the extra cost.

Putting everything from Lessons 2 and 3 together:

import re, unicodedata
import spacy
from nltk.corpus import stopwords
nlp = spacy.load("en_core_web_sm")
STOP = set(stopwords.words('english'))
def clean(text: str) -> list[str]:
text = text.lower()
text = "".join(c for c in unicodedata.normalize("NFKD", text)
if not unicodedata.combining(c))
text = re.sub(r"[^\w\s]", " ", text)
doc = nlp(text)
return [t.lemma_ for t in doc
if t.lemma_ not in STOP
and not t.is_space
and t.lemma_.strip()]
print(clean("The runners ran happily while studying NLP."))
# ['runner', 'run', 'happily', 'study', 'nlp']

This is the chain feeding into Bag-of-Words and TF-IDF — coming up next.

If you are going to feed text to a Transformer / LLM (BERT, GPT, Llama, sentence-transformers…), skip all of it. Let the model see the raw text. These models were trained with stop words and inflections — removing them throws away signal.

The whole linguistic-cleaning toolkit is for classical ML pipelines: TF-IDF + logistic regression, word clouds, simple search engines. Which is exactly what we’ll build in the next two lessons.

  • Stop words are extremely common words you can usually drop for BoW/TF-IDF.
  • Stemming = fast, dumb chopping. Output may not be a real word.
  • Lemmatisation = uses a dictionary. Output is a real word. Slower but better.
  • Skip all three for sentiment, generation, translation, or any Transformer pipeline.
  • The cleaning pipeline (Lessons 2 + 3) is the input to the vectorisation pipeline (Lessons 4 + 5).

Next: Bag-of-Words — the first way to turn cleaned text into numbers.