The deep learning revolution
The moment everything shifted
Section titled “The moment everything shifted”If we had to name one year that changed AI forever, it would be 2012. A team from the University of Toronto — Alex Krizhevsky, Ilya Sutskever and Geoffrey Hinton — entered the ImageNet image-recognition challenge with a model called AlexNet and crushed every competitor by 10 percentage points.
The model? A convolutional neural network (CNN) trained on two GPUs. Three ingredients came together at the same time, and AI hasn’t been the same since.
flowchart LR
A["Big data<br/>(ImageNet, 14M images)"] --> X
B["GPUs<br/>(parallel math)"] --> X
C["Neural net depth<br/>(many layers)"] --> X
X(("Deep<br/>Learning"))
X --> Y["State-of-the-art<br/>on every benchmark"]
What is a neural network anyway?
Section titled “What is a neural network anyway?”A neural network is a bunch of simple computational units (neurons) stacked into layers. Each layer transforms the input a little bit; stack enough layers and you can approximate almost any function.
flowchart LR I1["Input"] --> L1["Layer 1<br/>(detect edges)"] L1 --> L2["Layer 2<br/>(detect shapes)"] L2 --> L3["Layer 3<br/>(detect objects)"] L3 --> O1["Output<br/>(cat / dog / car)"]
Try it yourself
Section titled “Try it yourself”Below is a live 3D model of the network above. Pick an input on the left, watch the signal flow through three hidden layers in tiny yellow pulses, and see the right output neuron light up.
The word deep simply means many layers — AlexNet had 8, GPT-3 has 96, modern LLMs go into the hundreds.
Why deep learning works
Section titled “Why deep learning works”- Universal approximator — given enough layers and data, a neural net can model any pattern.
- Feature learning — no more manual feature engineering. The network discovers what’s important on its own.
- GPU-friendly — all the math is matrix multiplication, which GPUs do thousands of times faster than CPUs.
How features get sharper, layer after layer
Section titled “How features get sharper, layer after layer”Point #2 above deserves a picture. The genius of a deep network is that each layer composes the previous layer’s features into something more abstract. Show a network thousands of cat photos, and without any human guidance:
- The first layer learns to detect tiny edges — horizontal, vertical, diagonal lines.
- The second layer combines those edges into simple shapes — curves, circles, corners.
- The third layer combines those shapes into object parts — eyes, ears, noses, whiskers.
- The final layer asks: “does this picture contain an eye AND ears AND whiskers? → it’s a cat.”
This is why it’s called deep learning — the abstraction ladder only works because there are many layers. With one or two, the network can only learn shallow features and you’re back to where classical ML stalls.
The architectures that made the wave
Section titled “The architectures that made the wave”| Year | Architecture | What it unlocked |
|---|---|---|
| 2012 | CNN (AlexNet) | Image classification, vision |
| 2014 | GAN | Synthetic images, deepfakes |
| 2015 | RNN / LSTM | Sequence data, early translation |
| 2017 | Transformer | Long-range attention, the door to LLMs |
| 2020 | GPT-3 | First generally-capable language model |
| 2022 | Diffusion | Stable Diffusion, DALL·E, image generation |
2017 — the Transformer
Section titled “2017 — the Transformer”The paper “Attention is All You Need” (Vaswani et al.) introduces a new architecture that throws away recurrence and replaces it with self-attention — every token can look at every other token at once.
flowchart LR T1["The"] -->|attends to| T2["cat"] T1 -->|attends to| T3["sat"] T2 -->|attends to| T1 T2 -->|attends to| T3 T3 -->|attends to| T1 T3 -->|attends to| T2
Why this mattered:
- Parallelisable on GPUs — training scales beautifully.
- Long context — captures dependencies across thousands of tokens.
- General-purpose — the same architecture handles text, code, images, audio.
Every modern LLM (GPT-4, Claude, Llama, Mistral, DeepSeek) is a Transformer under the hood.
Why this isn’t the end of classical ML
Section titled “Why this isn’t the end of classical ML”Deep learning is not always the right tool. For tabular data (rows of numbers), small datasets, or problems where you need interpretability, classical ML usually wins — and it’s 100x cheaper to train.
Use deep learning when:
- Your data is unstructured (text, images, audio, video).
- You have lots of it (hundreds of thousands of examples or more).
- You can afford GPUs.
Use classical ML when:
- Your data fits in a CSV.
- You have hundreds to a few thousand examples.
- You need to explain every prediction to a regulator.
Key takeaways
Section titled “Key takeaways”- 2012 = AlexNet: deep CNNs crush ImageNet, the deep learning wave begins.
- The trifecta: big data + GPUs + depth.
- Each layer learns more abstract features — no manual feature engineering.
- 2017 = Transformer: parallel self-attention; the architecture every modern LLM is built on.
- Deep learning is not a silver bullet — classical ML still rules tabular data.
Next: From GenAI to LLMs — what makes a generative model different.