Skip to content

What is NLP?

NLP, NLU, NLG — three letters that get confused

Section titled “NLP, NLU, NLG — three letters that get confused”

You’ll hear them everywhere. They’re not synonyms.

AcronymStands forWhat it doesExample
NLPNatural Language ProcessingThe whole field — anything a computer does with human languageAll of the below
NLUNatural Language UnderstandingExtract meaning from textDetect sentiment, intent, entities
NLGNatural Language GenerationProduce text from data”The weather tomorrow will be sunny.”

NLP is the umbrella. NLU and NLG are two pieces under it. A chatbot does both: it understands what you wrote (NLU) then generates a reply (NLG).

Before Transformers ate everything, an NLP project looked like a small assembly line:

flowchart TB
  R["<b>Raw text</b><br/>'The cats are running...'"] --> N["<b>Normalise</b><br/>lowercase, accents, punctuation"]
  N --> T["<b>Tokenise</b><br/>['the', 'cats', 'are', 'running']"]
  T --> S["<b>Remove stop words</b><br/>['cats', 'running']"]
  S --> L["<b>Stem / Lemmatise</b><br/>['cat', 'run']"]
  L --> V["<b>Vectorise</b><br/>[0, 1, 0, ..., 2, 0]"]
  V --> M["<b>Model</b><br/>classifier, search, similarity..."]
  classDef step fill:#dbeafe,stroke:#2563eb,color:#0c4a6e
  R:::step
  N:::step
  T:::step
  S:::step
  L:::step
  V:::step
  M:::step
The pre-Transformer NLP pipeline — each step is one lesson of this Part.

Each box becomes a lesson:

  • Normalise + Tokenise → Lesson 2.
  • Stop words + Stemming + Lemmatisation → Lesson 3.
  • Vectorise (Bag-of-Words) → Lesson 4.
  • Weight terms (TF-IDF) → Lesson 5.
  • Modern alternative (embeddings + Transformers) → Lesson 6.

You may think “but LLMs do all of this in one shot!” — true. We’re learning the pipeline because:

  1. It’s interpretable: you understand exactly what the machine sees.
  2. It’s cheap: TF-IDF + logistic regression beats GPT-4 in many production tasks (and costs 1000× less).
  3. The vocabulary (token, embedding, vector) is the same in modern LLMs — only the maths got fancier.

Computers love structure. Human language is the opposite of structure. Five reasons it’s hard:

DifficultyExample
Ambiguity”I saw the man with the telescope.” Who has the telescope?
Context”It’s freezing.” Praise or complaint? Depends on the conversation.
Synonyms / paraphrase”buy”, “purchase”, “acquire”, “get” — same idea, four words.
Irony / sarcasm”Great. Another Monday.” Not actually great.
MultilingualityFrench, English, code-switching, emojis — all in one tweet.

Before deep learning, NLP papers were full of hand-crafted rules to handle each of these. They mostly didn’t work at scale. The 2010s embedding revolution and then Transformers solved many of them at once — but the classical concepts (token, stop word, vectorisation) are still the building blocks every modern system uses.

ApplicationWhat it doesWhere you see it
Sentiment analysisPositive / negative / neutralReviews, social listening
Named entity recognition (NER)Find people, places, dates in textNews, contracts, CVs
Topic modellingGroup documents by themeKnowledge bases, customer feedback
Machine translationOne language → anotherGoogle Translate, DeepL
Question answeringFind the answer in a documentCustomer support, search
SummarisationLong text → shortNews digests, meeting notes
Text classificationTag emails, route ticketsSpam filter, support routing
GenerationProduce textChatGPT, autocomplete

Behind every one of these — old or new — sits a chain like the pipeline above. The chain just got shorter and smarter.

A vocabulary cheat sheet for the rest of this Part

Section titled “A vocabulary cheat sheet for the rest of this Part”

You will meet these words in every lesson:

  • Corpus — your collection of documents. Plural: corpora.
  • Document — one piece of text: an email, a tweet, a chapter.
  • Token — the unit the model sees (often a word, sometimes a sub-word, sometimes a character).
  • Vocabulary — the set of all distinct tokens across the corpus.
  • Vector — a list of numbers that represents a token or a document.
  • Embedding — a learned vector that carries meaning.

Keep these six in mind. The whole field is built on them.

  • NLP is the field; NLU = understand, NLG = generate.
  • Classical NLP is a pipeline: normalise → tokenise → clean → vectorise → model.
  • Transformers shortened the pipeline but use the same vocabulary.
  • Language is hard because of ambiguity, context, paraphrase, irony, multilinguality.
  • Six words to know: corpus, document, token, vocabulary, vector, embedding.

Next: Tokens & text normalization — turning a raw string into countable units.