Tokens & text normalization
What is a token?
Section titled “What is a token?”A token is the smallest unit your model works with. A computer doesn’t see words — it sees a sequence of tokens that you decided to extract. Same sentence, three different tokenisations:
flowchart TB S["Learning NLP is fun!"] S --> W["Word-level<br/>[Learning, NLP, is, fun, !]"] S --> C["Character-level<br/>[L, e, a, r, n, i, n, g, ..., !]"] S --> Sub["Sub-word (BPE)<br/>[Learn, ing, NLP, is, fun, !]"] classDef src fill:#fde68a,stroke:#c2410c,color:#451a03 classDef tok fill:#d1fae5,stroke:#047857,color:#064e3b S:::src W:::tok C:::tok Sub:::tok
Which tokenisation when?
Section titled “Which tokenisation when?”| Granularity | Vocabulary size | Strengths | Weaknesses |
|---|---|---|---|
| Word | 50k–500k | Simple, interpretable | Misses unknown words, weak on compounds |
| Character | under 200 | Handles anything (typos, emojis) | Long sequences, weak signal per token |
| Sub-word (BPE, WordPiece, SentencePiece) | 30k–100k | Handles unknown words by splitting them | Less interpretable for humans |
Today’s standard: sub-word tokenisation (used by GPT, BERT, Llama). When you ask ChatGPT a question, it doesn’t see “learning” — it sees something like ["learn", "ing"]. We’ll come back to this in Lesson 6.
Simple word tokenisation in Python
Section titled “Simple word tokenisation in Python”For most beginner projects, word-level tokenisation is enough.
text = "Dr. Smith arrived at 8:30 a.m. — exhausted but happy!"
# Naïve split (often wrong)print(text.split())# ['Dr.', 'Smith', 'arrived', 'at', '8:30', 'a.m.', '—', 'exhausted', 'but', 'happy!']
# Better: a real tokenizerimport nltknltk.download('punkt')from nltk.tokenize import word_tokenizeprint(word_tokenize(text))# ['Dr.', 'Smith', 'arrived', 'at', '8:30', 'a.m.', '—', 'exhausted', 'but', 'happy', '!']Notice the differences:
split()leaveshappy!glued together.word_tokenizeseparates the!punctuation from the word.- Neither knows that
8:30is a single time — both leave it whole, which is usually what you want.
For production-grade tokenisation, use spaCy (more accurate, handles dozens of languages):
import spacynlp = spacy.load("en_core_web_sm")doc = nlp(text)print([t.text for t in doc])Text normalization — making “Café”, “café”, “CAFE” look the same
Section titled “Text normalization — making “Café”, “café”, “CAFE” look the same”Two documents containing the same idea must look identical to your model. Five common normalisation steps:
flowchart TB A["<b>Input</b><br/>'Café — €5.00 only!'"] --> B["<b>1. lowercase</b><br/>'café — €5.00 only!'"] B --> C["<b>2. strip accents</b><br/>'cafe — €5.00 only!'"] C --> D["<b>3. remove punctuation</b><br/>'cafe 500 only'"] D --> E["<b>4. collapse whitespace</b><br/>'cafe 500 only'"] E --> F["<b>5. (optional) digits → DIGIT</b><br/>'cafe DIGIT only'"] classDef step fill:#dbeafe,stroke:#2563eb,color:#0c4a6e A:::step B:::step C:::step D:::step E:::step F:::step
Lowercase
Section titled “Lowercase”text = "Apple is buying U.K. startup".lower()# "apple is buying u.k. startup"Watch out: lowercasing destroys the signal that “Apple” (company) is different from “apple” (fruit). For Named Entity Recognition, you do not lowercase.
Strip accents (or not!)
Section titled “Strip accents (or not!)”import unicodedata
def strip_accents(s): return "".join(c for c in unicodedata.normalize("NFKD", s) if not unicodedata.combining(c))
print(strip_accents("Café à Montréal")) # 'Cafe a Montreal'In French, removing accents merges "a" (verb form) with "à" (preposition). Sometimes helpful (less sparsity), sometimes harmful (loss of meaning). Test both.
Remove punctuation
Section titled “Remove punctuation”import retext = re.sub(r"[^\w\s]", "", "Hello, world! 5+3=8")# "Hello world 53 8"Useful for classification. Harmful for translation, generation, or anything where a ? matters.
Collapse whitespace
Section titled “Collapse whitespace”text = re.sub(r"\s+", " ", "hello \t world\n").strip()# "hello world"Cheap, always safe.
Handle digits
Section titled “Handle digits”If “I’m 30 years old” and “I’m 31 years old” should be treated as the same kind of sentence, replace digits:
re.sub(r"\d+", "NUM", "I'm 30 years old")# "I'm NUM years old"Special tokens — the ones you keep around
Section titled “Special tokens — the ones you keep around”Some tokens are useful even though they aren’t real words:
| Token | Purpose |
|---|---|
[PAD] | Pad short sequences to a fixed length |
[UNK] | Unknown word (out-of-vocabulary) |
[CLS] | ”Start of sentence” marker (BERT) |
[SEP] | Separator between two sentences (BERT) |
[MASK] | A word the model must guess (BERT training) |
<bos> / <eos> | Beginning / end of sentence (GPT, Llama) |
You won’t add these by hand for a TF-IDF model, but knowing they exist makes Lesson 6 much easier.
A complete mini-pipeline
Section titled “A complete mini-pipeline”import re, unicodedataimport nltknltk.download('punkt', quiet=True)from nltk.tokenize import word_tokenize
def normalise(text: str) -> list[str]: text = text.lower() text = "".join(c for c in unicodedata.normalize("NFKD", text) if not unicodedata.combining(c)) text = re.sub(r"[^\w\s]", " ", text) text = re.sub(r"\s+", " ", text).strip() return word_tokenize(text)
print(normalise("Café à Montréal — €5.00!"))# ['cafe', 'a', 'montreal', '5.00']You’ll plug a step like this into every classical NLP project.
Key takeaways
Section titled “Key takeaways”- A token is the unit your model sees — word, character, or sub-word.
- Sub-word tokenisation (BPE) is the modern standard (GPT, BERT, Llama).
- Normalisation is a chain: lowercase, strip accents, remove punctuation, collapse whitespace (and sometimes digits).
- Every choice (lowercase? accents?) is a trade-off — test on your task.
- Modern models still use the same building blocks plus special tokens (
[CLS],[SEP],<eos>).
Next: Stop words, stemming, lemmatization — three linguistic ways to make your vocabulary smaller.