From GenAI to LLMs

Discriminative vs generative — the fundamental split

Until around 2018, almost all ML models were discriminative: given an input, predict a label.

A generative model does the opposite — it learns the distribution of the data so well that it can produce new examples that look like the training set.

flowchart LR
  subgraph Disc["Discriminative model"]
    direction LR
    I1["Image of a cat"] --> M1["Model"] --> O1["Label: 'cat'"]
  end
  subgraph Gen["Generative model"]
    direction LR
    I2["Prompt: 'a cat'"] --> M2["Model"] --> O2["New image of a cat"]
  end

Same data, opposite directions. Generative AI is the engine behind ChatGPT, Midjourney, and code-completion.

GenAI now covers four big modalities:

Modality	Famous models
Text	GPT-4, Claude, Llama, Mistral
Image	DALL·E, Midjourney, Stable Diffusion
Audio	Whisper (input), ElevenLabs (output)
Video	Sora, Runway, Veo

What is a Large Language Model?

An LLM is a Transformer (cf. previous lesson) scaled up massively and trained on a huge slice of the internet to do one and only one task:

Given a sequence of tokens, predict the next token.

That’s it. The entire behaviour of ChatGPT comes from doing this one billion times in a row, very quickly, on a Transformer trained on hundreds of billions of tokens.

flowchart LR
  P["Prompt:<br/>'The capital of France is'"] --> M["LLM<br/>(Transformer)"]
  M --> T1["Paris"]
  T1 -.->|"feed back in"| M
  M --> T2["."]

Autoregressive generation — the model predicts one token, appends it, predicts again, and so on.

Tokens — the LLM’s unit of work

LLMs don’t see characters or words; they see tokens, which are roughly 3–4 characters or a fragment of a word.

Text	Tokens (approximate)
`"Hello"`	`["Hello"]` (1 token)
`"unbelievable"`	`["un", "believ", "able"]` (3 tokens)
`"こんにちは"`	`["こん", "にち", "は"]` (3 tokens)

Useful mental rule: 1,000 tokens ≈ 750 English words. When a provider says “context window: 128k tokens”, that’s about 96,000 words — a short book.

How an LLM is trained

LLM training happens in (at least) two stages:

flowchart LR
  A["Stage 1<br/>Pre-training<br/>on raw internet text"] --> B["Stage 2<br/>Fine-tuning<br/>on curated examples"]
  B --> C["Stage 3 (optional)<br/>RLHF<br/>human preference signal"]
  C --> D["Chat-ready LLM"]

From a raw next-token predictor to a useful chatbot — three stages, three different objectives.

Pre-training — self-supervised on terabytes of text. Cost: millions of dollars, weeks on thousands of GPUs. Output: a model that knows the world but is hard to talk to.
Fine-tuning (SFT) — supervised on tens of thousands of high-quality examples of “good answers”. Output: a model that responds in a useful format.
RLHF (reinforcement learning from human feedback) — humans rank pairs of answers, and the model is nudged to prefer the better one. Output: helpful, harmless, honest tone.

Some open-source models stop at stage 1 (“base models”) and let you fine-tune them yourself. We’ll do exactly that in Course 2.

What an LLM really is, in plain words

An LLM is a compressed snapshot of the internet, queried one token at a time. It is:

Stateless between calls (no memory unless you give it one).
Probabilistic — same prompt can give different answers.
Frozen in time — the cutoff date is whenever pre-training stopped.
Confidently wrong when it doesn’t know — this is called hallucination.

These four properties drive every limitation we’ll discuss in the next lesson.

Key takeaways

Discriminative models classify; generative models create.
An LLM = a giant Transformer trained to predict the next token.
Training has up to three stages: pre-training, fine-tuning, RLHF.
The unit of work is the token (~4 chars), not the word or character.
An LLM is a frozen, stateless, probabilistic snapshot of the internet.

Next: From LLMs to agents — what LLMs can’t do, and the stack we build to fix it (prompting → RAG → agents → agentic AI).