Skip to content

Tokens & text normalization

A token is the smallest unit your model works with. A computer doesn’t see words — it sees a sequence of tokens that you decided to extract. Same sentence, three different tokenisations:

flowchart TB
  S["Learning NLP is fun!"]
  S --> W["Word-level<br/>[Learning, NLP, is, fun, !]"]
  S --> C["Character-level<br/>[L, e, a, r, n, i, n, g, ..., !]"]
  S --> Sub["Sub-word (BPE)<br/>[Learn, ing, NLP, is, fun, !]"]
  classDef src fill:#fde68a,stroke:#c2410c,color:#451a03
  classDef tok fill:#d1fae5,stroke:#047857,color:#064e3b
  S:::src
  W:::tok
  C:::tok
  Sub:::tok
Three ways to tokenise the same sentence. Modern LLMs use sub-word tokens.
GranularityVocabulary sizeStrengthsWeaknesses
Word50k–500kSimple, interpretableMisses unknown words, weak on compounds
Characterunder 200Handles anything (typos, emojis)Long sequences, weak signal per token
Sub-word (BPE, WordPiece, SentencePiece)30k–100kHandles unknown words by splitting themLess interpretable for humans

Today’s standard: sub-word tokenisation (used by GPT, BERT, Llama). When you ask ChatGPT a question, it doesn’t see “learning” — it sees something like ["learn", "ing"]. We’ll come back to this in Lesson 6.

For most beginner projects, word-level tokenisation is enough.

text = "Dr. Smith arrived at 8:30 a.m. — exhausted but happy!"
# Naïve split (often wrong)
print(text.split())
# ['Dr.', 'Smith', 'arrived', 'at', '8:30', 'a.m.', '—', 'exhausted', 'but', 'happy!']
# Better: a real tokenizer
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
print(word_tokenize(text))
# ['Dr.', 'Smith', 'arrived', 'at', '8:30', 'a.m.', '—', 'exhausted', 'but', 'happy', '!']

Notice the differences:

  • split() leaves happy! glued together.
  • word_tokenize separates the ! punctuation from the word.
  • Neither knows that 8:30 is a single time — both leave it whole, which is usually what you want.

For production-grade tokenisation, use spaCy (more accurate, handles dozens of languages):

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
print([t.text for t in doc])

Text normalization — making “Café”, “café”, “CAFE” look the same

Section titled “Text normalization — making “Café”, “café”, “CAFE” look the same”

Two documents containing the same idea must look identical to your model. Five common normalisation steps:

flowchart TB
  A["<b>Input</b><br/>'Café — €5.00 only!'"] --> B["<b>1. lowercase</b><br/>'café — €5.00 only!'"]
  B --> C["<b>2. strip accents</b><br/>'cafe — €5.00 only!'"]
  C --> D["<b>3. remove punctuation</b><br/>'cafe  500 only'"]
  D --> E["<b>4. collapse whitespace</b><br/>'cafe 500 only'"]
  E --> F["<b>5. (optional) digits → DIGIT</b><br/>'cafe DIGIT only'"]
  classDef step fill:#dbeafe,stroke:#2563eb,color:#0c4a6e
  A:::step
  B:::step
  C:::step
  D:::step
  E:::step
  F:::step
A standard normalisation chain. Each step is optional — pick what your task needs.
text = "Apple is buying U.K. startup".lower()
# "apple is buying u.k. startup"

Watch out: lowercasing destroys the signal that “Apple” (company) is different from “apple” (fruit). For Named Entity Recognition, you do not lowercase.

import unicodedata
def strip_accents(s):
return "".join(c for c in unicodedata.normalize("NFKD", s) if not unicodedata.combining(c))
print(strip_accents("Café à Montréal")) # 'Cafe a Montreal'

In French, removing accents merges "a" (verb form) with "à" (preposition). Sometimes helpful (less sparsity), sometimes harmful (loss of meaning). Test both.

import re
text = re.sub(r"[^\w\s]", "", "Hello, world! 5+3=8")
# "Hello world 53 8"

Useful for classification. Harmful for translation, generation, or anything where a ? matters.

text = re.sub(r"\s+", " ", "hello \t world\n").strip()
# "hello world"

Cheap, always safe.

If “I’m 30 years old” and “I’m 31 years old” should be treated as the same kind of sentence, replace digits:

re.sub(r"\d+", "NUM", "I'm 30 years old")
# "I'm NUM years old"

Special tokens — the ones you keep around

Section titled “Special tokens — the ones you keep around”

Some tokens are useful even though they aren’t real words:

TokenPurpose
[PAD]Pad short sequences to a fixed length
[UNK]Unknown word (out-of-vocabulary)
[CLS]”Start of sentence” marker (BERT)
[SEP]Separator between two sentences (BERT)
[MASK]A word the model must guess (BERT training)
<bos> / <eos>Beginning / end of sentence (GPT, Llama)

You won’t add these by hand for a TF-IDF model, but knowing they exist makes Lesson 6 much easier.

import re, unicodedata
import nltk
nltk.download('punkt', quiet=True)
from nltk.tokenize import word_tokenize
def normalise(text: str) -> list[str]:
text = text.lower()
text = "".join(c for c in unicodedata.normalize("NFKD", text)
if not unicodedata.combining(c))
text = re.sub(r"[^\w\s]", " ", text)
text = re.sub(r"\s+", " ", text).strip()
return word_tokenize(text)
print(normalise("Café à Montréal — €5.00!"))
# ['cafe', 'a', 'montreal', '5.00']

You’ll plug a step like this into every classical NLP project.

  • A token is the unit your model sees — word, character, or sub-word.
  • Sub-word tokenisation (BPE) is the modern standard (GPT, BERT, Llama).
  • Normalisation is a chain: lowercase, strip accents, remove punctuation, collapse whitespace (and sometimes digits).
  • Every choice (lowercase? accents?) is a trade-off — test on your task.
  • Modern models still use the same building blocks plus special tokens ([CLS], [SEP], <eos>).

Next: Stop words, stemming, lemmatization — three linguistic ways to make your vocabulary smaller.