From GenAI to LLMs
Discriminative vs generative — the fundamental split
Section titled “Discriminative vs generative — the fundamental split”Until around 2018, almost all ML models were discriminative: given an input, predict a label.
A generative model does the opposite — it learns the distribution of the data so well that it can produce new examples that look like the training set.
flowchart LR
subgraph Disc["Discriminative model"]
direction LR
I1["Image of a cat"] --> M1["Model"] --> O1["Label: 'cat'"]
end
subgraph Gen["Generative model"]
direction LR
I2["Prompt: 'a cat'"] --> M2["Model"] --> O2["New image of a cat"]
end
GenAI now covers four big modalities:
| Modality | Famous models |
|---|---|
| Text | GPT-4, Claude, Llama, Mistral |
| Image | DALL·E, Midjourney, Stable Diffusion |
| Audio | Whisper (input), ElevenLabs (output) |
| Video | Sora, Runway, Veo |
What is a Large Language Model?
Section titled “What is a Large Language Model?”An LLM is a Transformer (cf. previous lesson) scaled up massively and trained on a huge slice of the internet to do one and only one task:
Given a sequence of tokens, predict the next token.
That’s it. The entire behaviour of ChatGPT comes from doing this one billion times in a row, very quickly, on a Transformer trained on hundreds of billions of tokens.
flowchart LR P["Prompt:<br/>'The capital of France is'"] --> M["LLM<br/>(Transformer)"] M --> T1["Paris"] T1 -.->|"feed back in"| M M --> T2["."]
Tokens — the LLM’s unit of work
Section titled “Tokens — the LLM’s unit of work”LLMs don’t see characters or words; they see tokens, which are roughly 3–4 characters or a fragment of a word.
| Text | Tokens (approximate) |
|---|---|
"Hello" | ["Hello"] (1 token) |
"unbelievable" | ["un", "believ", "able"] (3 tokens) |
"こんにちは" | ["こん", "にち", "は"] (3 tokens) |
Useful mental rule: 1,000 tokens ≈ 750 English words. When a provider says “context window: 128k tokens”, that’s about 96,000 words — a short book.
How an LLM is trained
Section titled “How an LLM is trained”LLM training happens in (at least) two stages:
flowchart LR A["Stage 1<br/>Pre-training<br/>on raw internet text"] --> B["Stage 2<br/>Fine-tuning<br/>on curated examples"] B --> C["Stage 3 (optional)<br/>RLHF<br/>human preference signal"] C --> D["Chat-ready LLM"]
- Pre-training — self-supervised on terabytes of text. Cost: millions of dollars, weeks on thousands of GPUs. Output: a model that knows the world but is hard to talk to.
- Fine-tuning (SFT) — supervised on tens of thousands of high-quality examples of “good answers”. Output: a model that responds in a useful format.
- RLHF (reinforcement learning from human feedback) — humans rank pairs of answers, and the model is nudged to prefer the better one. Output: helpful, harmless, honest tone.
Some open-source models stop at stage 1 (“base models”) and let you fine-tune them yourself. We’ll do exactly that in Course 2.
What an LLM really is, in plain words
Section titled “What an LLM really is, in plain words”An LLM is a compressed snapshot of the internet, queried one token at a time. It is:
- Stateless between calls (no memory unless you give it one).
- Probabilistic — same prompt can give different answers.
- Frozen in time — the cutoff date is whenever pre-training stopped.
- Confidently wrong when it doesn’t know — this is called hallucination.
These four properties drive every limitation we’ll discuss in the next lesson.
Key takeaways
Section titled “Key takeaways”- Discriminative models classify; generative models create.
- An LLM = a giant Transformer trained to predict the next token.
- Training has up to three stages: pre-training, fine-tuning, RLHF.
- The unit of work is the token (~4 chars), not the word or character.
- An LLM is a frozen, stateless, probabilistic snapshot of the internet.
Next: From LLMs to agents — what LLMs can’t do, and the stack we build to fix it (prompting → RAG → agents → agentic AI).