Skip to content

From GenAI to LLMs

Discriminative vs generative — the fundamental split

Section titled “Discriminative vs generative — the fundamental split”

Until around 2018, almost all ML models were discriminative: given an input, predict a label.

A generative model does the opposite — it learns the distribution of the data so well that it can produce new examples that look like the training set.

flowchart LR
  subgraph Disc["Discriminative model"]
    direction LR
    I1["Image of a cat"] --> M1["Model"] --> O1["Label: 'cat'"]
  end
  subgraph Gen["Generative model"]
    direction LR
    I2["Prompt: 'a cat'"] --> M2["Model"] --> O2["New image of a cat"]
  end
Same data, opposite directions. Generative AI is the engine behind ChatGPT, Midjourney, and code-completion.

GenAI now covers four big modalities:

ModalityFamous models
TextGPT-4, Claude, Llama, Mistral
ImageDALL·E, Midjourney, Stable Diffusion
AudioWhisper (input), ElevenLabs (output)
VideoSora, Runway, Veo

An LLM is a Transformer (cf. previous lesson) scaled up massively and trained on a huge slice of the internet to do one and only one task:

Given a sequence of tokens, predict the next token.

That’s it. The entire behaviour of ChatGPT comes from doing this one billion times in a row, very quickly, on a Transformer trained on hundreds of billions of tokens.

flowchart LR
  P["Prompt:<br/>'The capital of France is'"] --> M["LLM<br/>(Transformer)"]
  M --> T1["Paris"]
  T1 -.->|"feed back in"| M
  M --> T2["."]
Autoregressive generation — the model predicts one token, appends it, predicts again, and so on.

LLMs don’t see characters or words; they see tokens, which are roughly 3–4 characters or a fragment of a word.

TextTokens (approximate)
"Hello"["Hello"] (1 token)
"unbelievable"["un", "believ", "able"] (3 tokens)
"こんにちは"["こん", "にち", "は"] (3 tokens)

Useful mental rule: 1,000 tokens ≈ 750 English words. When a provider says “context window: 128k tokens”, that’s about 96,000 words — a short book.

LLM training happens in (at least) two stages:

flowchart LR
  A["Stage 1<br/>Pre-training<br/>on raw internet text"] --> B["Stage 2<br/>Fine-tuning<br/>on curated examples"]
  B --> C["Stage 3 (optional)<br/>RLHF<br/>human preference signal"]
  C --> D["Chat-ready LLM"]
From a raw next-token predictor to a useful chatbot — three stages, three different objectives.
  1. Pre-training — self-supervised on terabytes of text. Cost: millions of dollars, weeks on thousands of GPUs. Output: a model that knows the world but is hard to talk to.
  2. Fine-tuning (SFT) — supervised on tens of thousands of high-quality examples of “good answers”. Output: a model that responds in a useful format.
  3. RLHF (reinforcement learning from human feedback) — humans rank pairs of answers, and the model is nudged to prefer the better one. Output: helpful, harmless, honest tone.

Some open-source models stop at stage 1 (“base models”) and let you fine-tune them yourself. We’ll do exactly that in Course 2.

An LLM is a compressed snapshot of the internet, queried one token at a time. It is:

  • Stateless between calls (no memory unless you give it one).
  • Probabilistic — same prompt can give different answers.
  • Frozen in time — the cutoff date is whenever pre-training stopped.
  • Confidently wrong when it doesn’t know — this is called hallucination.

These four properties drive every limitation we’ll discuss in the next lesson.

  • Discriminative models classify; generative models create.
  • An LLM = a giant Transformer trained to predict the next token.
  • Training has up to three stages: pre-training, fine-tuning, RLHF.
  • The unit of work is the token (~4 chars), not the word or character.
  • An LLM is a frozen, stateless, probabilistic snapshot of the internet.

Next: From LLMs to agents — what LLMs can’t do, and the stack we build to fix it (prompting → RAG → agents → agentic AI).