Tokens & text normalization

What is a token?

A token is the smallest unit your model works with. A computer doesn’t see words — it sees a sequence of tokens that you decided to extract. Same sentence, three different tokenisations:

flowchart TB
  S["Learning NLP is fun!"]
  S --> W["Word-level<br/>[Learning, NLP, is, fun, !]"]
  S --> C["Character-level<br/>[L, e, a, r, n, i, n, g, ..., !]"]
  S --> Sub["Sub-word (BPE)<br/>[Learn, ing, NLP, is, fun, !]"]
  classDef src fill:#fde68a,stroke:#c2410c,color:#451a03
  classDef tok fill:#d1fae5,stroke:#047857,color:#064e3b
  S:::src
  W:::tok
  C:::tok
  Sub:::tok

Three ways to tokenise the same sentence. Modern LLMs use sub-word tokens.

Which tokenisation when?

Granularity	Vocabulary size	Strengths	Weaknesses
Word	50k–500k	Simple, interpretable	Misses unknown words, weak on compounds
Character	under 200	Handles anything (typos, emojis)	Long sequences, weak signal per token
Sub-word (BPE, WordPiece, SentencePiece)	30k–100k	Handles unknown words by splitting them	Less interpretable for humans

Today’s standard: sub-word tokenisation (used by GPT, BERT, Llama). When you ask ChatGPT a question, it doesn’t see “learning” — it sees something like ["learn", "ing"]. We’ll come back to this in Lesson 6.

Simple word tokenisation in Python

For most beginner projects, word-level tokenisation is enough.

text = "Dr. Smith arrived at 8:30 a.m. — exhausted but happy!"

# Naïve split (often wrong)
print(text.split())
# ['Dr.', 'Smith', 'arrived', 'at', '8:30', 'a.m.', '—', 'exhausted', 'but', 'happy!']

# Better: a real tokenizer
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
print(word_tokenize(text))
# ['Dr.', 'Smith', 'arrived', 'at', '8:30', 'a.m.', '—', 'exhausted', 'but', 'happy', '!']

Notice the differences:

split() leaves happy! glued together.
word_tokenize separates the ! punctuation from the word.
Neither knows that 8:30 is a single time — both leave it whole, which is usually what you want.

For production-grade tokenisation, use spaCy (more accurate, handles dozens of languages):

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
print([t.text for t in doc])

Text normalization — making “Café”, “café”, “CAFE” look the same

Two documents containing the same idea must look identical to your model. Five common normalisation steps:

flowchart TB
  A["<b>Input</b><br/>'Café — €5.00 only!'"] --> B["<b>1. lowercase</b><br/>'café — €5.00 only!'"]
  B --> C["<b>2. strip accents</b><br/>'cafe — €5.00 only!'"]
  C --> D["<b>3. remove punctuation</b><br/>'cafe  500 only'"]
  D --> E["<b>4. collapse whitespace</b><br/>'cafe 500 only'"]
  E --> F["<b>5. (optional) digits → DIGIT</b><br/>'cafe DIGIT only'"]
  classDef step fill:#dbeafe,stroke:#2563eb,color:#0c4a6e
  A:::step
  B:::step
  C:::step
  D:::step
  E:::step
  F:::step

A standard normalisation chain. Each step is optional — pick what your task needs.

Lowercase

text = "Apple is buying U.K. startup".lower()
# "apple is buying u.k. startup"

Watch out: lowercasing destroys the signal that “Apple” (company) is different from “apple” (fruit). For Named Entity Recognition, you do not lowercase.

Strip accents (or not!)

import unicodedata

def strip_accents(s):
    return "".join(c for c in unicodedata.normalize("NFKD", s) if not unicodedata.combining(c))

print(strip_accents("Café à Montréal"))  # 'Cafe a Montreal'

In French, removing accents merges "a" (verb form) with "à" (preposition). Sometimes helpful (less sparsity), sometimes harmful (loss of meaning). Test both.

Remove punctuation

import re
text = re.sub(r"[^\w\s]", "", "Hello, world! 5+3=8")
# "Hello world 53 8"

Useful for classification. Harmful for translation, generation, or anything where a ? matters.

Collapse whitespace

text = re.sub(r"\s+", " ", "hello   \t world\n").strip()
# "hello world"

Cheap, always safe.

Handle digits

If “I’m 30 years old” and “I’m 31 years old” should be treated as the same kind of sentence, replace digits:

re.sub(r"\d+", "NUM", "I'm 30 years old")
# "I'm NUM years old"

Special tokens — the ones you keep around

Some tokens are useful even though they aren’t real words:

Token	Purpose
`[PAD]`	Pad short sequences to a fixed length
`[UNK]`	Unknown word (out-of-vocabulary)
`[CLS]`	”Start of sentence” marker (BERT)
`[SEP]`	Separator between two sentences (BERT)
`[MASK]`	A word the model must guess (BERT training)
`<bos>` / `<eos>`	Beginning / end of sentence (GPT, Llama)

You won’t add these by hand for a TF-IDF model, but knowing they exist makes Lesson 6 much easier.

A complete mini-pipeline

import re, unicodedata
import nltk
nltk.download('punkt', quiet=True)
from nltk.tokenize import word_tokenize

def normalise(text: str) -> list[str]:
    text = text.lower()
    text = "".join(c for c in unicodedata.normalize("NFKD", text)
                   if not unicodedata.combining(c))
    text = re.sub(r"[^\w\s]", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return word_tokenize(text)

print(normalise("Café à Montréal — €5.00!"))
# ['cafe', 'a', 'montreal', '5.00']

You’ll plug a step like this into every classical NLP project.

Key takeaways

A token is the unit your model sees — word, character, or sub-word.
Sub-word tokenisation (BPE) is the modern standard (GPT, BERT, Llama).
Normalisation is a chain: lowercase, strip accents, remove punctuation, collapse whitespace (and sometimes digits).
Every choice (lowercase? accents?) is a trade-off — test on your task.
Modern models still use the same building blocks plus special tokens ([CLS], [SEP], <eos>).

Next: Stop words, stemming, lemmatization — three linguistic ways to make your vocabulary smaller.