Annex — CPU, GPU, RAM, VRAM: the simple version

Duration: 20 min Pre-requisite: none. This chapter is the foundation that 05c — Local LLM catalogue and 05d — Quantization explained build on.

Why this annex exists

Most local-LLM questions on the workshop floor reduce to four words: CPU, GPU, RAM, VRAM. They look like jargon. They are actually four very simple ideas that, once seen clearly, explain almost every “why doesn’t this fit?” question.

This annex uses plain language and a single running analogy. The plan:

Two kinds of “brain” in a computer — CPU vs GPU.
Two kinds of “desk” in a computer — RAM vs VRAM.
The copy dance — why moving data between them matters.
Unified memory — the special case where there is only one desk.
What happens when the model does not fit — offload, and the speed cliff.
How to check what you have on Windows, macOS and Linux.
How Ollama uses CPU and GPU.
A decision tree by hardware.

1. Two kinds of “brain” — CPU vs GPU

A computer has two very different processors. They are good at very different things.

1.1 The CPU — the generalist

The CPU (Central Processing Unit) is the main brain. Modern CPUs have between 4 and 32 cores. Each core is highly capable — it can run any kind of computation, branch on conditions, manage files, handle the operating system, schedule programs, do the small bits of math that an application needs every microsecond.

Plain-words analogy. Picture an office with 8 highly trained engineers. Each engineer can do anything: read a contract, write a report, take a phone call, do a tricky calculation. They are excellent at jumping between tasks. They are very expensive to hire.

That is a CPU.

1.2 The GPU — the specialist

The GPU (Graphics Processing Unit) was originally designed to draw images on screens. Drawing a screen means doing the same simple math millions of times in parallel (one calculation per pixel). To do that, GPU manufacturers made a very different chip: instead of 8 strong cores, the GPU has thousands of small, simple cores that all do the same operation at the same time.

Plain-words analogy. Picture a warehouse with 3 000 interns. Each intern can only do one specific task: multiply two numbers together. They cannot handle a phone call, they cannot read a contract. But if you give them all the same kind of small math problem at the same time, they finish the whole pile in a flash.

That is a GPU.

1.3 Why LLMs love GPUs

A large language model generating one token of text does billions of multiply-and-add operations between numbers (the weights from chapter 05d). Those operations are all the same — different numbers, same shape of math. This is exactly the kind of work a GPU was built for.

Concrete order of magnitude on the same task:

Hardware	Tokens per second on `llama3.1:8b` (Q4)
Modern laptop CPU (8 cores, no GPU)	5 – 15 tokens/s
Desktop GPU 8 GB VRAM (RTX 3060)	30 – 60 tokens/s
Workstation GPU 24 GB VRAM (RTX 4090)	80 – 120 tokens/s
GB10 unified memory (128 GB)	60 – 100 tokens/s on the same 8B; runs much bigger models comfortably

A typical sentence is around 20 to 40 tokens. A CPU-only setup takes a few seconds per sentence; a GPU does it in well under a second. The difference is not subtle.

2. Two kinds of “desk” — RAM vs VRAM

Each brain needs a desk to put its work on. They do not share desks.

2.1 RAM — the CPU’s desk

RAM (Random Access Memory) is the system memory of the computer. It lives on sticks plugged into the motherboard. When you open an application, the operating system loads it from disk into RAM so the CPU can work on it. Typical sizes today: 8 GB on a basic laptop, 16 – 32 GB on a workstation laptop, 64 – 128 GB on a workstation desktop.

Plain-words analogy. RAM is the big filing cabinet next to the engineers (the CPU). They reach into it constantly. Reading from it is fast, but not instant.

2.2 VRAM — the GPU’s desk

VRAM (Video RAM) is the memory that lives on the graphics card itself, soldered next to the GPU chip. It is much faster than RAM (3 to 10 times the bandwidth) because it sits right next to the GPU. It is smaller and much more expensive than RAM. Typical sizes: 4 GB on a basic discrete GPU, 8 – 12 GB on a mid-range gaming card, 24 GB on a high-end gamer card (RTX 4090), 48 GB on a workstation card (RTX 6000 Ada), 80 GB on a data-centre card (H100).

Plain-words analogy. VRAM is the interns’ open work tables. Small, but everything on them is immediately within reach. The interns (the GPU cores) can grab any number from the tables in a single arm motion.

2.3 The shape of the constraint

For an LLM:

The weights of the model need to live in memory somewhere.
If they live in RAM, the CPU does the math: slow but possible.
If they live in VRAM, the GPU does the math: 5 – 20× faster.
If they do not fit in either, the model cannot run (or runs from disk, which is unusable for interactive use).

That single fact explains 90 % of “can I run this model?” questions.

3. The copy dance — why moving data between RAM and VRAM matters

RAM and VRAM are separate physical memories. They are not the same chip. They are not the same desk. To use the GPU, the CPU must first copy the data from RAM to VRAM over a bus called PCIe.

Plain-words analogy. Before the warehouse interns can work, someone (the operating system) has to carry every file from the filing cabinet to the work tables. The carrying is fast on PCIe 4 (~25 GB/s) and faster on PCIe 5 (~50 GB/s), but it is not instant. For a 4.7 GB model, the first load takes around 0.2 to 0.5 seconds.

After the initial copy, the GPU works on the data without going back to RAM for the duration of the generation. The CPU only steps in to receive the generated tokens and stream them back to the user.

The practical consequence: starting a model is slower than generating with it. The first answer feels delayed; subsequent answers are at the full GPU speed.

4. Unified memory — the special case where there is only one desk

Some recent platforms break the RAM/VRAM separation entirely. Apple Silicon (M1, M2, M3, M4 family) and the NVIDIA Grace Blackwell GB10 (used in the DGX Spark and the Dell Pro Max with GB10) put the CPU and the GPU on the same chip, and give them one shared memory pool — usually called unified memory.

Plain-words analogy. Instead of “engineers with their filing cabinet” plus “interns with their work tables”, picture one open-floor office where the engineers and the interns share the same big filing system. Nobody has to carry files around. Whoever needs something just reaches.

The consequences:

No copy step between RAM and VRAM. The GPU sees the model directly.
Capacity is one big number, not two small ones. A 128 GB unified machine can hold a 70B model in Q4 (~40 GB) plus the context, plus everything else, with room to spare.
A laptop with 24 GB VRAM and 32 GB RAM can run an 8B model fast and that is it. A 128 GB unified machine can run a 70B model, which is qualitatively a different class.

This is the reason a GB10-class machine is interesting for a workshop: it does not just have “more memory”, it has all its memory available to the GPU at once.

5. What happens when the model does not fit

There are three scenarios, in increasing order of pain.

5.1 Model fits entirely in VRAM (or unified memory)

The GPU does everything. Full speed. This is the situation you want for the workshop.

Example	Model	Hardware
Comfortable	`llama3.1:8b` Q4 (~5 GB)	RTX 3060 (12 GB VRAM) — fits with room to spare
Comfortable	`qwen2.5-coder:14b` Q4 (~9 GB)	RTX 4070 (12 GB VRAM) — fits
Comfortable	`llama3.1:70b` Q4 (~43 GB)	GB10 (128 GB unified) — fits with room for context and a second model

5.2 Model fits in RAM but not in VRAM — CPU mode

The CPU does everything. Slow but workable. This is the situation on most laptops without a discrete GPU. Speeds of 5 – 15 tokens/s on an 8B model are typical, slowing to 1 – 3 tokens/s on a 14B and below 1 token/s on anything bigger.

5.3 Partial offload — mixed mode (the speed cliff)

If a model is bigger than VRAM but smaller than RAM, Ollama can place some layers on the GPU and the rest on the CPU. This is called offload. It works, but it is far slower than full GPU mode because the layers on CPU become the bottleneck, and data has to bounce between RAM and VRAM during generation.

Example: qwen2.5-coder:14b Q4 (~9 GB) on an 8 GB VRAM card. About 80 % of the layers fit on the GPU; the last 20 % run on the CPU. Result: roughly 5 – 10 tokens/s instead of the 30 – 40 you would see if the whole model fit.

Practical lesson. Pick a model that fits entirely in one tier. A 7B-Q4 fully on the GPU beats a 14B-Q4 with partial offload, every time, for live demos.

6. How to check what you have

You cannot pick the right model if you do not know what is in the machine. The commands below take 30 seconds.

CPU and RAM:

# CPU model and core count
Get-CimInstance Win32_Processor | Select-Object Name, NumberOfCores, NumberOfLogicalProcessors

# Total RAM
Get-CimInstance Win32_PhysicalMemory | Measure-Object -Property Capacity -Sum |
  ForEach-Object { "{0:N1} GB" -f ($_.Sum / 1GB) }

GPU and VRAM:

# Simple view — Task Manager → Performance → GPU shows "Dedicated GPU memory"

# NVIDIA cards (recommended)
nvidia-smi

nvidia-smi lists every NVIDIA GPU, its driver version, and the dedicated VRAM in MB.

# Chip, CPU cores, GPU cores, unified memory
system_profiler SPHardwareDataType

On Apple Silicon, RAM is unified memory. The number reported is the total budget shared between CPU and GPU.

# CPU
lscpu | head -n 20

# RAM
free -h

# GPU + VRAM (NVIDIA)
nvidia-smi

# GPU + VRAM (AMD)
rocm-smi

What to write down for the workshop:

Field	Example value
CPU model	Intel Core i7-13700H
CPU cores / threads	14 / 20
RAM total	32 GB
GPU model	NVIDIA RTX 3060 Laptop
VRAM total	6 GB
Unified memory?	No (Windows) / Yes (Mac M-series, GB10)

With those six numbers, chapter 05c tells you exactly which models will run.

7. How Ollama uses CPU and GPU

Ollama does the hard work for you. On startup, it detects the available GPU (NVIDIA, AMD, or Apple Silicon), checks the available VRAM, and decides how many layers of the model to place on the GPU. The rest goes to RAM and runs on the CPU.

No GPU detected: full CPU mode.
GPU big enough: full GPU mode. Maximum speed.
GPU partially enough: partial offload. Slower than full GPU, faster than full CPU.

You can inspect what Ollama actually did for a running model with:

ollama ps

The PROCESSOR column shows 100% GPU, 100% CPU or a mixed value like 73% GPU, 27% CPU. If you see anything other than 100% GPU on a machine that has a usable GPU, you are paying the offload cost from section 5.3 — consider switching to a smaller model or a more aggressive quantization.

Two environment variables let you override the defaults if you ever need to (rarely useful at the workshop level):

# Force a specific number of layers on the GPU (0 = pure CPU)
$Env:OLLAMA_NUM_GPU = "20"

# Force a specific GPU when several are present
$Env:CUDA_VISIBLE_DEVICES = "0"

8. A decision tree by hardware

The same logic as the workshop tiers, expressed in the language of this annex.

Symptom: Task Manager shows only “Intel UHD” / “AMD Radeon (integrated)”, or nvidia-smi fails.

Run CPU mode only. Stay on 3B – 8B models in Q4.
Tested for the demos: llama3.1:8b Q4 — expect 5 – 15 tokens/s.
Anything above 14B in CPU mode is too slow for a live exercise.

Symptom: nvidia-smi shows 6 GB or 8 GB on an RTX 3050 / 3060 / 4060 laptop, or similar.

3B – 8B models in Q4 fit fully on the GPU. Full speed (30 – 60 tokens/s).
14B models cause partial offload; expect 5 – 10 tokens/s.
Stick with 8B for live demos; experiment with 14B on the side.

Symptom: nvidia-smi shows 12 GB (RTX 4070 / 3080) up to 24 GB (RTX 3090 / 4090).

8B – 14B models in Q4 fully on the GPU. Full speed.
32B in Q4 (~20 GB) fits on a 24 GB card with no headroom — works, but the context buffer eats into the margin.
70B does not fit; partial offload is too slow.

9. A unifying mental model

Two sentences sum up everything in this annex:

CPU is for control, GPU is for parallel math. RAM is for the CPU, VRAM is for the GPU — unless the machine has unified memory, in which case there is one big shared desk.

If you remember this, you can read any “hardware spec” page on the internet and translate it into “what local LLM can I run on this machine?” without needing anyone’s help.

Key takeaways

A CPU has a few strong cores (~8). A GPU has thousands of small cores. LLMs love GPUs because their math is the same simple operation repeated billions of times.
RAM is the CPU’s memory. VRAM is the GPU’s memory. They are separate. Data has to be copied from RAM to VRAM for the GPU to use it.
Unified memory (Apple Silicon, GB10) abolishes the copy step and gives the GPU access to the whole memory pool.
An LLM must fit in some memory to run. Full VRAM = full speed. Full RAM, no GPU = slow but workable. Partial offload = the speed cliff; usually worth avoiding for live demos.
Pick a model whose quantized size + context + 10 – 30 % fits entirely in your VRAM (or unified pool). A 7B at full GPU speed beats a 14B with partial offload, every time.
nvidia-smi on Windows / Linux, system_profiler SPHardwareDataType on Apple Silicon, ollama ps to see what Ollama actually loaded — the three commands that answer 90 % of workshop questions.
The GB10 is interesting not because it has “more memory” but because its 128 GB are unified — that single fact moves the workshop’s reachable model class from 14B to 70B.
This annex is the foundation for chapter 05c (which models exist) and chapter 05d (how they shrink to fit your memory).