What is NLP?
NLP, NLU, NLG — three letters that get confused
Section titled “NLP, NLU, NLG — three letters that get confused”You’ll hear them everywhere. They’re not synonyms.
| Acronym | Stands for | What it does | Example |
|---|---|---|---|
| NLP | Natural Language Processing | The whole field — anything a computer does with human language | All of the below |
| NLU | Natural Language Understanding | Extract meaning from text | Detect sentiment, intent, entities |
| NLG | Natural Language Generation | Produce text from data | ”The weather tomorrow will be sunny.” |
NLP is the umbrella. NLU and NLG are two pieces under it. A chatbot does both: it understands what you wrote (NLU) then generates a reply (NLG).
The classical NLP pipeline
Section titled “The classical NLP pipeline”Before Transformers ate everything, an NLP project looked like a small assembly line:
flowchart TB R["<b>Raw text</b><br/>'The cats are running...'"] --> N["<b>Normalise</b><br/>lowercase, accents, punctuation"] N --> T["<b>Tokenise</b><br/>['the', 'cats', 'are', 'running']"] T --> S["<b>Remove stop words</b><br/>['cats', 'running']"] S --> L["<b>Stem / Lemmatise</b><br/>['cat', 'run']"] L --> V["<b>Vectorise</b><br/>[0, 1, 0, ..., 2, 0]"] V --> M["<b>Model</b><br/>classifier, search, similarity..."] classDef step fill:#dbeafe,stroke:#2563eb,color:#0c4a6e R:::step N:::step T:::step S:::step L:::step V:::step M:::step
Each box becomes a lesson:
- Normalise + Tokenise → Lesson 2.
- Stop words + Stemming + Lemmatisation → Lesson 3.
- Vectorise (Bag-of-Words) → Lesson 4.
- Weight terms (TF-IDF) → Lesson 5.
- Modern alternative (embeddings + Transformers) → Lesson 6.
You may think “but LLMs do all of this in one shot!” — true. We’re learning the pipeline because:
- It’s interpretable: you understand exactly what the machine sees.
- It’s cheap: TF-IDF + logistic regression beats GPT-4 in many production tasks (and costs 1000× less).
- The vocabulary (token, embedding, vector) is the same in modern LLMs — only the maths got fancier.
Why language is hard
Section titled “Why language is hard”Computers love structure. Human language is the opposite of structure. Five reasons it’s hard:
| Difficulty | Example |
|---|---|
| Ambiguity | ”I saw the man with the telescope.” Who has the telescope? |
| Context | ”It’s freezing.” Praise or complaint? Depends on the conversation. |
| Synonyms / paraphrase | ”buy”, “purchase”, “acquire”, “get” — same idea, four words. |
| Irony / sarcasm | ”Great. Another Monday.” Not actually great. |
| Multilinguality | French, English, code-switching, emojis — all in one tweet. |
Before deep learning, NLP papers were full of hand-crafted rules to handle each of these. They mostly didn’t work at scale. The 2010s embedding revolution and then Transformers solved many of them at once — but the classical concepts (token, stop word, vectorisation) are still the building blocks every modern system uses.
What NLP enables — a quick tour
Section titled “What NLP enables — a quick tour”| Application | What it does | Where you see it |
|---|---|---|
| Sentiment analysis | Positive / negative / neutral | Reviews, social listening |
| Named entity recognition (NER) | Find people, places, dates in text | News, contracts, CVs |
| Topic modelling | Group documents by theme | Knowledge bases, customer feedback |
| Machine translation | One language → another | Google Translate, DeepL |
| Question answering | Find the answer in a document | Customer support, search |
| Summarisation | Long text → short | News digests, meeting notes |
| Text classification | Tag emails, route tickets | Spam filter, support routing |
| Generation | Produce text | ChatGPT, autocomplete |
Behind every one of these — old or new — sits a chain like the pipeline above. The chain just got shorter and smarter.
A vocabulary cheat sheet for the rest of this Part
Section titled “A vocabulary cheat sheet for the rest of this Part”You will meet these words in every lesson:
- Corpus — your collection of documents. Plural: corpora.
- Document — one piece of text: an email, a tweet, a chapter.
- Token — the unit the model sees (often a word, sometimes a sub-word, sometimes a character).
- Vocabulary — the set of all distinct tokens across the corpus.
- Vector — a list of numbers that represents a token or a document.
- Embedding — a learned vector that carries meaning.
Keep these six in mind. The whole field is built on them.
Key takeaways
Section titled “Key takeaways”- NLP is the field; NLU = understand, NLG = generate.
- Classical NLP is a pipeline: normalise → tokenise → clean → vectorise → model.
- Transformers shortened the pipeline but use the same vocabulary.
- Language is hard because of ambiguity, context, paraphrase, irony, multilinguality.
- Six words to know: corpus, document, token, vocabulary, vector, embedding.
Next: Tokens & text normalization — turning a raw string into countable units.