The deep learning revolution

The moment everything shifted

If we had to name one year that changed AI forever, it would be 2012. A team from the University of Toronto — Alex Krizhevsky, Ilya Sutskever and Geoffrey Hinton — entered the ImageNet image-recognition challenge with a model called AlexNet and crushed every competitor by 10 percentage points.

The model? A convolutional neural network (CNN) trained on two GPUs. Three ingredients came together at the same time, and AI hasn’t been the same since.

flowchart LR
  A["Big data<br/>(ImageNet, 14M images)"] --> X
  B["GPUs<br/>(parallel math)"] --> X
  C["Neural net depth<br/>(many layers)"] --> X
  X(("Deep<br/>Learning"))
  X --> Y["State-of-the-art<br/>on every benchmark"]

The 2012 trifecta: data + compute + depth. Each was useless without the other two.

What is a neural network anyway?

A neural network is a bunch of simple computational units (neurons) stacked into layers. Each layer transforms the input a little bit; stack enough layers and you can approximate almost any function.

flowchart LR
  I1["Input"] --> L1["Layer 1<br/>(detect edges)"]
  L1 --> L2["Layer 2<br/>(detect shapes)"]
  L2 --> L3["Layer 3<br/>(detect objects)"]
  L3 --> O1["Output<br/>(cat / dog / car)"]

Each layer learns increasingly abstract features. This is the core idea of deep learning.

Try it yourself

Below is a live 3D model of the network above. Pick an input on the left, watch the signal flow through three hidden layers in tiny yellow pulses, and see the right output neuron light up.

drag to rotate · scroll to zoom

cat

Input

Drag to rotate · scroll to zoom · pick an input below

The word deep simply means many layers — AlexNet had 8, GPT-3 has 96, modern LLMs go into the hundreds.

Why deep learning works

Universal approximator — given enough layers and data, a neural net can model any pattern.
Feature learning — no more manual feature engineering. The network discovers what’s important on its own.
GPU-friendly — all the math is matrix multiplication, which GPUs do thousands of times faster than CPUs.

How features get sharper, layer after layer

Point #2 above deserves a picture. The genius of a deep network is that each layer composes the previous layer’s features into something more abstract. Show a network thousands of cat photos, and without any human guidance:

The first layer learns to detect tiny edges — horizontal, vertical, diagonal lines.
The second layer combines those edges into simple shapes — curves, circles, corners.
The third layer combines those shapes into object parts — eyes, ears, noses, whiskers.
The final layer asks: “does this picture contain an eye AND ears AND whiskers? → it’s a cat.”

Input raw pixels

Layer 1 edges

Layer 2 textures & curves

Layer 3 object parts

👁️

👂

👃

Output classification

🐱 CAT

96%

🐕 DOG

🚗 CAR

The same image, decoded into progressively higher-level features. Each column shows what that layer 'detects' — and no human had to write those filters.

This is why it’s called deep learning — the abstraction ladder only works because there are many layers. With one or two, the network can only learn shallow features and you’re back to where classical ML stalls.

The architectures that made the wave

Year	Architecture	What it unlocked
2012	CNN (AlexNet)	Image classification, vision
2014	GAN	Synthetic images, deepfakes
2015	RNN / LSTM	Sequence data, early translation
2017	Transformer	Long-range attention, the door to LLMs
2020	GPT-3	First generally-capable language model
2022	Diffusion	Stable Diffusion, DALL·E, image generation

2017 — the Transformer

The paper “Attention is All You Need” (Vaswani et al.) introduces a new architecture that throws away recurrence and replaces it with self-attention — every token can look at every other token at once.

flowchart LR
  T1["The"] -->|attends to| T2["cat"]
  T1 -->|attends to| T3["sat"]
  T2 -->|attends to| T1
  T2 -->|attends to| T3
  T3 -->|attends to| T1
  T3 -->|attends to| T2

Every token sees every other token — Transformer's superpower.

Why this mattered:

Parallelisable on GPUs — training scales beautifully.
Long context — captures dependencies across thousands of tokens.
General-purpose — the same architecture handles text, code, images, audio.

Every modern LLM (GPT-4, Claude, Llama, Mistral, DeepSeek) is a Transformer under the hood.

Why this isn’t the end of classical ML

Deep learning is not always the right tool. For tabular data (rows of numbers), small datasets, or problems where you need interpretability, classical ML usually wins — and it’s 100x cheaper to train.

Use deep learning when:

Your data is unstructured (text, images, audio, video).
You have lots of it (hundreds of thousands of examples or more).
You can afford GPUs.

Use classical ML when:

Your data fits in a CSV.
You have hundreds to a few thousand examples.
You need to explain every prediction to a regulator.

Key takeaways

2012 = AlexNet: deep CNNs crush ImageNet, the deep learning wave begins.
The trifecta: big data + GPUs + depth.
Each layer learns more abstract features — no manual feature engineering.
2017 = Transformer: parallel self-attention; the architecture every modern LLM is built on.
Deep learning is not a silver bullet — classical ML still rules tabular data.

Next: From GenAI to LLMs — what makes a generative model different.