What is NLP?

NLP, NLU, NLG — three letters that get confused

You’ll hear them everywhere. They’re not synonyms.

Acronym	Stands for	What it does	Example
NLP	Natural Language Processing	The whole field — anything a computer does with human language	All of the below
NLU	Natural Language Understanding	Extract meaning from text	Detect sentiment, intent, entities
NLG	Natural Language Generation	Produce text from data	”The weather tomorrow will be sunny.”

NLP is the umbrella. NLU and NLG are two pieces under it. A chatbot does both: it understands what you wrote (NLU) then generates a reply (NLG).

The classical NLP pipeline

Before Transformers ate everything, an NLP project looked like a small assembly line:

flowchart TB
  R["<b>Raw text</b><br/>'The cats are running...'"] --> N["<b>Normalise</b><br/>lowercase, accents, punctuation"]
  N --> T["<b>Tokenise</b><br/>['the', 'cats', 'are', 'running']"]
  T --> S["<b>Remove stop words</b><br/>['cats', 'running']"]
  S --> L["<b>Stem / Lemmatise</b><br/>['cat', 'run']"]
  L --> V["<b>Vectorise</b><br/>[0, 1, 0, ..., 2, 0]"]
  V --> M["<b>Model</b><br/>classifier, search, similarity..."]
  classDef step fill:#dbeafe,stroke:#2563eb,color:#0c4a6e
  R:::step
  N:::step
  T:::step
  S:::step
  L:::step
  V:::step
  M:::step

The pre-Transformer NLP pipeline — each step is one lesson of this Part.

Each box becomes a lesson:

Normalise + Tokenise → Lesson 2.
Stop words + Stemming + Lemmatisation → Lesson 3.
Vectorise (Bag-of-Words) → Lesson 4.
Weight terms (TF-IDF) → Lesson 5.
Modern alternative (embeddings + Transformers) → Lesson 6.

You may think “but LLMs do all of this in one shot!” — true. We’re learning the pipeline because:

It’s interpretable: you understand exactly what the machine sees.
It’s cheap: TF-IDF + logistic regression beats GPT-4 in many production tasks (and costs 1000× less).
The vocabulary (token, embedding, vector) is the same in modern LLMs — only the maths got fancier.

Why language is hard

Computers love structure. Human language is the opposite of structure. Five reasons it’s hard:

Difficulty	Example
Ambiguity	”I saw the man with the telescope.” Who has the telescope?
Context	”It’s freezing.” Praise or complaint? Depends on the conversation.
Synonyms / paraphrase	”buy”, “purchase”, “acquire”, “get” — same idea, four words.
Irony / sarcasm	”Great. Another Monday.” Not actually great.
Multilinguality	French, English, code-switching, emojis — all in one tweet.

Before deep learning, NLP papers were full of hand-crafted rules to handle each of these. They mostly didn’t work at scale. The 2010s embedding revolution and then Transformers solved many of them at once — but the classical concepts (token, stop word, vectorisation) are still the building blocks every modern system uses.

What NLP enables — a quick tour

Application	What it does	Where you see it
Sentiment analysis	Positive / negative / neutral	Reviews, social listening
Named entity recognition (NER)	Find people, places, dates in text	News, contracts, CVs
Topic modelling	Group documents by theme	Knowledge bases, customer feedback
Machine translation	One language → another	Google Translate, DeepL
Question answering	Find the answer in a document	Customer support, search
Summarisation	Long text → short	News digests, meeting notes
Text classification	Tag emails, route tickets	Spam filter, support routing
Generation	Produce text	ChatGPT, autocomplete

Behind every one of these — old or new — sits a chain like the pipeline above. The chain just got shorter and smarter.

A vocabulary cheat sheet for the rest of this Part

You will meet these words in every lesson:

Corpus — your collection of documents. Plural: corpora.
Document — one piece of text: an email, a tweet, a chapter.
Token — the unit the model sees (often a word, sometimes a sub-word, sometimes a character).
Vocabulary — the set of all distinct tokens across the corpus.
Vector — a list of numbers that represents a token or a document.
Embedding — a learned vector that carries meaning.

Keep these six in mind. The whole field is built on them.

Key takeaways

NLP is the field; NLU = understand, NLG = generate.
Classical NLP is a pipeline: normalise → tokenise → clean → vectorise → model.
Transformers shortened the pipeline but use the same vocabulary.
Language is hard because of ambiguity, context, paraphrase, irony, multilinguality.
Six words to know: corpus, document, token, vocabulary, vector, embedding.

Next: Tokens & text normalization — turning a raw string into countable units.