Stop words, stemming, lemmatization
The vocabulary problem
Section titled “The vocabulary problem”After tokenising a corpus you typically end up with tens of thousands of distinct tokens. Most carry little signal — “the” appears in every document, and “running”, “runs”, “ran” look like three different words to the computer. Three classical tools to fix this:
flowchart TB V["Vocabulary too big<br/>+ noise"] V --> S["Stop words<br/>(remove very common words)"] V --> St["Stemming<br/>(crude root extraction)"] V --> L["Lemmatisation<br/>(grammatically-aware root)"] S --> R["Smaller, cleaner vocabulary"] St --> R L --> R classDef problem fill:#fee2e2,stroke:#dc2626 classDef tool fill:#dbeafe,stroke:#2563eb classDef out fill:#d1fae5,stroke:#047857 V:::problem S:::tool St:::tool L:::tool R:::out
1. Stop words — kill the most common words
Section titled “1. Stop words — kill the most common words”Stop words are extremely common words that rarely change the meaning of a document: the, a, of, and, in, to, is, that… Removing them shrinks your vocabulary and forces the model to focus on the words that matter.
from nltk.corpus import stopwordsimport nltknltk.download('stopwords', quiet=True)
stop = set(stopwords.words('english'))
text = "The cat is sitting on the mat".lower().split()clean = [w for w in text if w not in stop]print(clean)# ['cat', 'sitting', 'mat']Build your own list
Section titled “Build your own list”The default NLTK list has ~180 English stop words. Often you want to add or remove some:
custom = stop | {"http", "https", "www", "rt"} # tweetscustom -= {"not", "no"} # keep negationsWhen to NOT remove stop words
Section titled “When to NOT remove stop words”| Task | Stop words? |
|---|---|
| Topic classification (news, support tickets) | Remove — they add noise |
| TF-IDF search | Remove — they dominate counts |
| Sentiment analysis | Keep — “not good” ≠ “good” |
| Question answering | Keep — “who is…” matters |
| Machine translation / generation | Keep everything |
| Any Transformer / LLM workflow | Keep everything — the model handles it |
The golden rule: stop-word removal was useful for Bag-of-Words and TF-IDF. It’s almost always wrong for neural models.
2. Stemming — chop the suffix
Section titled “2. Stemming — chop the suffix”A stemmer mechanically chops the end of a word to (try to) reach its root. Fast, dumb, no dictionary required.
from nltk.stem import PorterStemmerps = PorterStemmer()
words = ["running", "runs", "ran", "runner", "happily", "studies"]print([ps.stem(w) for w in words])# ['run', 'run', 'ran', 'runner', 'happili', 'studi']Three observations:
runningandrunscollapse torun— good, that’s the point.ranstaysran— Porter doesn’t know irregular verbs.happily→happiliandstudies→studi— stems are not real words. The computer doesn’t care; it just wants the same key for variants.
Popular stemmers
Section titled “Popular stemmers”| Stemmer | Speed | Aggressiveness | Languages |
|---|---|---|---|
| Porter | Fast | Medium | English only |
| Snowball | Fast | Medium (improved Porter) | 17 languages incl. French |
| Lancaster | Fast | Very aggressive | English |
For French:
from nltk.stem.snowball import FrenchStemmerfs = FrenchStemmer()print(fs.stem("courons")) # 'cour'print(fs.stem("mangeaient"))# 'mang'3. Lemmatization — the dictionary version
Section titled “3. Lemmatization — the dictionary version”A lemmatiser uses a dictionary and the part-of-speech (POS) tag to return the actual canonical form (the lemma).
import spacynlp = spacy.load("en_core_web_sm")
doc = nlp("The runners ran happily while studying.")for t in doc: print(f"{t.text:10s} -> {t.lemma_}")# The -> the# runners -> runner# ran -> run# happily -> happily# while -> while# studying -> studyCompared to stemming: ran → run, studying → study, happily → happily (correctly kept as adverb). Lemmas are real words.
Stemming vs lemmatisation — pick one
Section titled “Stemming vs lemmatisation — pick one”| Stemming | Lemmatisation | |
|---|---|---|
| Speed | Very fast | 10–100× slower |
| Output | Often not a real word | Always a real word |
| Quality | Approximate | Linguistically correct |
| Dependencies | None | Dictionary + POS tagger |
| Languages | Many | Fewer, but the major ones are covered |
| Use case | Search, exploratory analysis | Production NLP pipelines |
Default rule: use lemmatisation if you can afford it (spaCy is fast enough for most projects). Use stemming only if you need raw speed on huge data.
One sentence through the four steps, side by side
The same input phrase processed through each tool, token by token:
Input:
"The runners were running faster and the studies showed they enjoyed it."
| Step | Output |
|---|---|
| 1. Tokenisation (split) | [The, runners, were, running, faster, and, the, studies, showed, they, enjoyed, it, .] |
| 2. After stop-word removal | [runners, running, faster, studies, showed, enjoyed, .] |
| 3. After stemming (Porter) | [runner, run, faster, studi, show, enjoy, .] |
| 4. After lemmatisation (spaCy) | [runner, run, fast, study, show, enjoy, .] |
Token-by-token comparison of the two normalisations:
| Original | Porter stem | spaCy lemma | Comment |
|---|---|---|---|
runners | runner | runner | Same result |
running | run | run | Same result |
faster | faster | fast | Lemma understands “faster” is a degree of “fast” — stemmer just chops |
studies | studi | study | Stem is not a real word — lemma is |
showed | show | show | Same result |
enjoyed | enjoy | enjoy | Same result |
from nltk.stem import PorterStemmerimport spacy
ps = PorterStemmer()nlp = spacy.load("en_core_web_sm")tokens = ["runners", "running", "faster", "studies", "showed", "enjoyed"]
print("Stem: ", [ps.stem(w) for w in tokens])print("Lemma: ", [t.lemma_ for t in nlp(" ".join(tokens))])# Stem: ['runner', 'run', 'faster', 'studi', 'show', 'enjoy']# Lemma: ['runner', 'run', 'fast', 'study', 'show', 'enjoy']Two takeaways:
- On 4 of 6 tokens the two tools agree. Most of the time, either choice is fine.
- Stemming fails silently on irregular morphology (
faster → faster,studies → studi). If your downstream consumer is a human (search results, word clouds), lemmatisation is worth the extra cost.
A complete cleaning pipeline
Section titled “A complete cleaning pipeline”Putting everything from Lessons 2 and 3 together:
import re, unicodedataimport spacyfrom nltk.corpus import stopwords
nlp = spacy.load("en_core_web_sm")STOP = set(stopwords.words('english'))
def clean(text: str) -> list[str]: text = text.lower() text = "".join(c for c in unicodedata.normalize("NFKD", text) if not unicodedata.combining(c)) text = re.sub(r"[^\w\s]", " ", text) doc = nlp(text) return [t.lemma_ for t in doc if t.lemma_ not in STOP and not t.is_space and t.lemma_.strip()]
print(clean("The runners ran happily while studying NLP."))# ['runner', 'run', 'happily', 'study', 'nlp']This is the chain feeding into Bag-of-Words and TF-IDF — coming up next.
When NOT to use any of this
Section titled “When NOT to use any of this”If you are going to feed text to a Transformer / LLM (BERT, GPT, Llama, sentence-transformers…), skip all of it. Let the model see the raw text. These models were trained with stop words and inflections — removing them throws away signal.
The whole linguistic-cleaning toolkit is for classical ML pipelines: TF-IDF + logistic regression, word clouds, simple search engines. Which is exactly what we’ll build in the next two lessons.
Key takeaways
Section titled “Key takeaways”- Stop words are extremely common words you can usually drop for BoW/TF-IDF.
- Stemming = fast, dumb chopping. Output may not be a real word.
- Lemmatisation = uses a dictionary. Output is a real word. Slower but better.
- Skip all three for sentiment, generation, translation, or any Transformer pipeline.
- The cleaning pipeline (Lessons 2 + 3) is the input to the vectorisation pipeline (Lessons 4 + 5).
Next: Bag-of-Words — the first way to turn cleaned text into numbers.