Skip to content

The deep learning revolution

If we had to name one year that changed AI forever, it would be 2012. A team from the University of Toronto — Alex Krizhevsky, Ilya Sutskever and Geoffrey Hinton — entered the ImageNet image-recognition challenge with a model called AlexNet and crushed every competitor by 10 percentage points.

The model? A convolutional neural network (CNN) trained on two GPUs. Three ingredients came together at the same time, and AI hasn’t been the same since.

flowchart LR
  A["Big data<br/>(ImageNet, 14M images)"] --> X
  B["GPUs<br/>(parallel math)"] --> X
  C["Neural net depth<br/>(many layers)"] --> X
  X(("Deep<br/>Learning"))
  X --> Y["State-of-the-art<br/>on every benchmark"]
The 2012 trifecta: data + compute + depth. Each was useless without the other two.

A neural network is a bunch of simple computational units (neurons) stacked into layers. Each layer transforms the input a little bit; stack enough layers and you can approximate almost any function.

flowchart LR
  I1["Input"] --> L1["Layer 1<br/>(detect edges)"]
  L1 --> L2["Layer 2<br/>(detect shapes)"]
  L2 --> L3["Layer 3<br/>(detect objects)"]
  L3 --> O1["Output<br/>(cat / dog / car)"]
Each layer learns increasingly abstract features. This is the core idea of deep learning.

Below is a live 3D model of the network above. Pick an input on the left, watch the signal flow through three hidden layers in tiny yellow pulses, and see the right output neuron light up.

drag to rotate · scroll to zoom
cat
Input
Drag to rotate · scroll to zoom · pick an input below

The word deep simply means many layers — AlexNet had 8, GPT-3 has 96, modern LLMs go into the hundreds.

  1. Universal approximator — given enough layers and data, a neural net can model any pattern.
  2. Feature learning — no more manual feature engineering. The network discovers what’s important on its own.
  3. GPU-friendly — all the math is matrix multiplication, which GPUs do thousands of times faster than CPUs.

How features get sharper, layer after layer

Section titled “How features get sharper, layer after layer”

Point #2 above deserves a picture. The genius of a deep network is that each layer composes the previous layer’s features into something more abstract. Show a network thousands of cat photos, and without any human guidance:

  • The first layer learns to detect tiny edges — horizontal, vertical, diagonal lines.
  • The second layer combines those edges into simple shapes — curves, circles, corners.
  • The third layer combines those shapes into object parts — eyes, ears, noses, whiskers.
  • The final layer asks: “does this picture contain an eye AND ears AND whiskers? → it’s a cat.”
Input raw pixels
Layer 1 edges
Layer 2 textures & curves
Layer 3 object parts
👁️
👂
👃
Output classification
🐱 CAT
96%
🐕 DOG
3%
🚗 CAR
1%
The same image, decoded into progressively higher-level features. Each column shows what that layer 'detects' — and no human had to write those filters.

This is why it’s called deep learning — the abstraction ladder only works because there are many layers. With one or two, the network can only learn shallow features and you’re back to where classical ML stalls.

YearArchitectureWhat it unlocked
2012CNN (AlexNet)Image classification, vision
2014GANSynthetic images, deepfakes
2015RNN / LSTMSequence data, early translation
2017TransformerLong-range attention, the door to LLMs
2020GPT-3First generally-capable language model
2022DiffusionStable Diffusion, DALL·E, image generation

The paper “Attention is All You Need” (Vaswani et al.) introduces a new architecture that throws away recurrence and replaces it with self-attention — every token can look at every other token at once.

flowchart LR
  T1["The"] -->|attends to| T2["cat"]
  T1 -->|attends to| T3["sat"]
  T2 -->|attends to| T1
  T2 -->|attends to| T3
  T3 -->|attends to| T1
  T3 -->|attends to| T2
Every token sees every other token — Transformer's superpower.

Why this mattered:

  • Parallelisable on GPUs — training scales beautifully.
  • Long context — captures dependencies across thousands of tokens.
  • General-purpose — the same architecture handles text, code, images, audio.

Every modern LLM (GPT-4, Claude, Llama, Mistral, DeepSeek) is a Transformer under the hood.

Deep learning is not always the right tool. For tabular data (rows of numbers), small datasets, or problems where you need interpretability, classical ML usually wins — and it’s 100x cheaper to train.

Use deep learning when:

  • Your data is unstructured (text, images, audio, video).
  • You have lots of it (hundreds of thousands of examples or more).
  • You can afford GPUs.

Use classical ML when:

  • Your data fits in a CSV.
  • You have hundreds to a few thousand examples.
  • You need to explain every prediction to a regulator.
  • 2012 = AlexNet: deep CNNs crush ImageNet, the deep learning wave begins.
  • The trifecta: big data + GPUs + depth.
  • Each layer learns more abstract features — no manual feature engineering.
  • 2017 = Transformer: parallel self-attention; the architecture every modern LLM is built on.
  • Deep learning is not a silver bullet — classical ML still rules tabular data.

Next: From GenAI to LLMs — what makes a generative model different.