Annex — Quantization, from scratch

Duration: 20 min Pre-requisite: chapter 05c (vocabulary on parameters, RAM, VRAM)

Why this annex exists

Quantization is the single most misunderstood part of running local LLMs. It explains why llama3.1:8b fits on a laptop with 8 GB of RAM despite having 8 billion parameters, why qwen2.5-coder:32b is around 20 GB and not 64 GB on disk, and why Ollama’s default Q4_K_M is production-grade, not a degraded mode.

This annex takes 20 minutes to read and gives every member of the workshop the same mental model. It is organised in five sections:

What a parameter is, and why precision matters.
The footprint problem — why FP16 makes 70B models unusable on consumer hardware.
Quantization itself — the mechanics, the naming convention, a worked numerical example.
The quality cost — what gets lost at Q4 vs Q8 vs FP16, on which kinds of tasks.
A decision rule — which quantization to pick for which machine, with the Ollama commands.

1. What a parameter is — and why “precision” matters

1.1 A parameter is just a number

A modern LLM is a network of billions of small numerical weights. Each weight is a real number between roughly −10 and +10, learned during training. When the model generates text, it multiplies and adds these numbers billions of times per token.

A few examples of what a weight might actually be:

weight_42       =  0.7384
weight_2_310    = -1.2055
weight_8_991_007 =  0.0001

That is the whole content of a model. Billions of numbers like these. A 7-billion-parameter model holds 7 × 10⁹ of them. A 70B model holds 70 × 10⁹.

1.2 The precision question

A number like 0.7384 can be stored in many different ways inside the computer:

Encoding	Bits used	Approximate value stored
FP16 (half precision)	16 bits	0.7383 (essentially exact)
FP8	8 bits	0.74
INT4 / Q4	4 bits	~0.75
INT2 / Q2	2 bits	~0.67 or ~0.83 (very coarse)

The fewer bits used to represent each weight, the less precise the value, but the less memory the model needs.

That is the entire idea of quantization.

2. The footprint problem

2.1 The naive cost — FP16

The reference precision used during training is FP16 (16 bits = 2 bytes per weight) or BF16 (Brain-Float 16, also 16 bits, with a layout better suited to deep learning).

For a 7B model:

7 × 10⁹ weights × 2 bytes = 14 GB just for the weights

For a 70B model:

70 × 10⁹ weights × 2 bytes = 140 GB just for the weights

These numbers are before any context buffer, attention cache, runtime overhead or operating-system memory. With overhead, a FP16 70B model needs roughly 160 – 180 GB of memory to run comfortably.

That is well beyond the memory of any consumer laptop. Even a high-end workstation with two RTX A6000 cards (48 GB each = 96 GB total VRAM) cannot host it without partial CPU offload.

2.2 The footprint table for FP16

Model size	FP16 weights	Practical FP16 memory (with overhead)
3B	~6 GB	~8 – 10 GB
7B	~14 GB	~18 – 20 GB
8B	~16 GB	~20 – 24 GB
14B	~28 GB	~36 – 40 GB
32B	~64 GB	~80 – 96 GB
70B	~140 GB	~160 – 180 GB
405B	~810 GB	well above 1 TB
671B	~1.34 TB	research infrastructure only

In FP16, even a 14B model exceeds what a standard workstation can host. Without quantization, local LLMs would be a research-only practice.

3. Quantization itself

3.1 The mechanics in one paragraph

Quantization replaces each FP16 weight by a smaller integer or a smaller floating-point value. The model is grouped into small blocks of weights (typically 32 or 64 at a time). For each block, two numbers are stored: a scale (a multiplier) and a zero point (an offset). Each individual weight is then replaced by a low-bit integer that, multiplied by the scale and shifted by the zero point, approximately reconstructs the original value at inference time.

That is the entire idea. The model is not retrained. Only the way the weights are stored is changed.

3.2 The naming convention

Ollama uses the GGUF file format and the Q<n>_<type> naming convention:

Tag	Bits per weight	Notes
`Q2_K`	~2 bits	Extreme compression; visible quality loss
`Q3_K_S` / `Q3_K_M` / `Q3_K_L`	~3 bits	Small (S), Medium (M), Large (L) K-quants
`Q4_0` / `Q4_1`	~4 bits	Older 4-bit formats
`Q4_K_M`	~4 bits	Ollama’s default — production-grade for most tasks
`Q5_K_S` / `Q5_K_M`	~5 bits	Sweet spot when 4 bits feels too aggressive
`Q6_K`	~6 bits	Near-imperceptible quality loss
`Q8_0`	~8 bits	Almost identical to FP16 on most benchmarks
`F16`	16 bits	The reference (no quantization)

The K in Q4_K_M, Q5_K_S etc. refers to K-quants — a family of methods that distribute precision better across the model (giving more bits to the layers that need them most and fewer bits to the layers that tolerate compression well). The suffix (_S, _M, _L) is just Small / Medium / Large and controls how aggressive the compression is.

For 99 % of practical use, the only tag you need to know by name is Q4_K_M, and the only knob you adjust is the size class.

3.3 A worked numerical example

Take a single weight: weight_42 = 0.7384. Now look at what is actually stored at each precision.

Precision	Storage	Value reconstructed at inference	Absolute error
FP16	16 bits, sign + 5-bit exponent + 10-bit mantissa	0.7383	~0.0001
Q8	8-bit integer per weight + block scale	0.7373	~0.0011
Q6	6-bit integer per weight + block scale	0.7345	~0.0039
Q5	5-bit integer per weight + block scale	0.7188	~0.0196
Q4	4-bit integer per weight + block scale	0.7500	~0.0116
Q3	3-bit integer per weight + block scale	0.6875	~0.0509
Q2	2-bit integer per weight + block scale	0.6667	~0.0717

A single weight loses a tiny amount of precision. But the model has billions of weights, and the arithmetic happens billions of times per token. The errors do compound — that is the whole point of the trade-off.

What rescues this in practice is that:

Errors are not all systematic in the same direction. Many cancel out across a layer.
The model is highly redundant. Many weights play similar roles. Losing precision on a few does not destroy the global behaviour.
K-quants spend more bits on the weights that matter most (attention layers, embedding tables) and fewer bits on the rest.

The end result, on Q4_K_M, is a model that:

Takes about 25 % of the FP16 file size.
Scores within 1 – 3 percentage points of FP16 on standard benchmarks.
Is indistinguishable for most users on chat, code completion, summarisation.

3.4 The footprint table — same model, different quantizations

This is the table the workshop needs on the screen. Same model parameter count, different quantization choices, very different memory budgets.

Parameters	FP16	Q8	Q6	Q5	Q4 (default)	Q3	Q2
1B	~2.0 GB	~1.0 GB	~0.8 GB	~0.7 GB	~0.6 GB	~0.5 GB	~0.4 GB
3B	~6.0 GB	~3.0 GB	~2.4 GB	~2.0 GB	~1.7 GB	~1.4 GB	~1.1 GB
7B	~14 GB	~7.0 GB	~5.5 GB	~4.8 GB	~4.0 GB	~3.3 GB	~2.6 GB
8B	~16 GB	~8.0 GB	~6.3 GB	~5.5 GB	~4.6 GB	~3.8 GB	~3.0 GB
14B	~28 GB	~14 GB	~11 GB	~9.5 GB	~8.0 GB	~6.5 GB	~5.2 GB
32B	~64 GB	~32 GB	~26 GB	~22 GB	~19 GB	~15 GB	~12 GB
70B	~140 GB	~70 GB	~56 GB	~48 GB	~40 GB	~33 GB	~26 GB
405B	~810 GB	~405 GB	~324 GB	~280 GB	~230 GB	~190 GB	~150 GB

4. The quality cost — what we actually observe

4.1 The big picture

Quantization is not free, but on most workshop-grade tasks the cost is small. The table below summarises what the open-source community measures on standard benchmarks (MMLU, HumanEval, GSM8K, etc.) compared to the FP16 reference.

Quantization	Typical benchmark drop vs FP16	Practical perception
F16	0 (reference)	The training-time quality
Q8	< 0.5 percentage points	Indistinguishable from FP16 on chat, code, summarisation
Q6	~0.5 – 1 pp	Indistinguishable on most tasks; tiny edge-case differences
Q5	~1 – 2 pp	Indistinguishable to most users; visible on hard reasoning prompts
Q4 (default)	~1 – 3 pp	Production-grade for chat, code, summarisation, RAG. Visible on long-chain reasoning or competitive math.
Q3	~3 – 6 pp	Visible quality drop; acceptable only when memory is severely constrained
Q2	~6 – 12 pp	Visible drop on almost everything; emergency setting

4.2 What gets affected first

When precision drops, the failures appear in a predictable order:

Long multi-step reasoning (chains of 5+ deductive steps) — degrades from Q5 down.
Competitive mathematics and exact arithmetic — degrades from Q5 down.
Code generation with strict formatting (JSON tool calls, for example — see chapter 05b on qwen2.5-coder:7b’s tool-call escaping) — degrades from Q4 down.
Long-context recall (asking about something said 20 K tokens earlier) — degrades from Q4 down.
General chat, summarisation, simple translation, basic code completion — robust down to Q3, sometimes Q2.

For the workshop demos in this course (chat, simple Java agents, comparator, prompt engineering), Q4_K_M is fine for everything.

4.3 A useful mental shortcut

A larger model in a lower precision usually beats a smaller model in a higher precision.

Concretely, on the same memory budget:

Memory budget	Option A	Option B	Usually better
~5 GB	`llama3.1:8b` Q4	`llama3.2:3b` Q8	A (the 8B at Q4)
~10 GB	`qwen2.5-coder:14b` Q4	`qwen2.5-coder:7b` Q8	A (the 14B at Q4)
~25 GB	`qwen2.5-coder:32b` Q4	`qwen2.5-coder:14b` Q8	A (the 32B at Q4)
~50 GB	`llama3.1:70b` Q4	`qwen2.5-coder:32b` Q8	A (the 70B at Q4)

This is why Ollama defaults to Q4_K_M: it lets users run the largest possible model on their available hardware, at a quality cost that is almost always smaller than the gain from going up one parameter tier.

5. A decision rule — how to choose

5.1 By machine tier

Default: Q4_K_M. That is what ollama pull <model> gives you. Trust it.
Do not go below Q4 on this tier — the quality drop becomes visible.
Do not go above Q4 unless you have specific evidence that the task fails on Q4 (typically hard reasoning or competitive math).
For tool-calling agents (chapter 09), Q4_K_M of llama3.1:8b is the validated configuration.

Default: Q4_K_M for any model the GPU can hold entirely. You will get full GPU speed.
Q5_K_M or Q6_K are reasonable on a 12 – 16 GB VRAM card if you have headroom — small quality bump, slight speed cost.
Avoid mixing Q8 of one model and Q4 of another in a comparison: the precision difference confounds the model-quality comparison you are trying to make.

For benchmarks that aim to compare model quality, prefer Q8_0 or F16 when memory allows. This isolates the model from the quantization noise.
For production use (someone querying the model in a chat), Q4_K_M still wins on speed and on cost-per-token, with a quality cost that is rarely visible.
For 70B and above, Q4_K_M is the practical choice even on a GB10 — F16 of 70B (~140 GB) does not fit in 128 GB of unified memory.

5.2 Inspecting and pulling a specific quantization in Ollama

Ollama tags include the quantization. Default llama3.1:8b resolves to llama3.1:8b-instruct-q4_K_M. To pick something else explicitly:

# Default — Q4_K_M
ollama pull llama3.1:8b

# Explicit Q8_0 (better quality, double the size)
ollama pull llama3.1:8b-instruct-q8_0

# Explicit Q5_K_M
ollama pull llama3.1:8b-instruct-q5_K_M

# F16 (the reference; 14 GB on disk for the 8B)
ollama pull llama3.1:8b-instruct-fp16

To inspect what you have actually pulled:

ollama show llama3.1:8b

The output reports the architecture, the parameter count, the quantization, the context length, and the embedding length. The quantization line tells you exactly which precision is on disk.

5.3 Pitfalls and clichés to avoid

Cliché you may hear	Reality
”Q4 is a degraded mode for laptops.”	No. Q4_K_M is Ollama’s production default and the configuration used in most public deployments of local LLMs.
”More bits is always better.”	Above Q6 / Q8, the gain is statistically real but not perceptible on most workshop tasks.
”Quantizing makes the model dumber.”	For chat, code completion and summarisation, no. For long-chain reasoning and competitive math, slightly.
”Quantizing changes the parameter count.”	No. The number of weights is unchanged. Only the way each weight is encoded is different.
”A 14B-Q8 must be better than a 32B-Q4.”	Usually false. On the same memory budget, the larger model at lower precision wins.
”I can run a 70B model on my laptop, I’ll just use Q2.”	The math says you can fit it. The quality at Q2 makes it unusable for anything past basic chat. Pick a smaller model at Q4 instead.

6. A unifying mental model

A useful way to picture it, in two sentences:

The parameter count measures how much the model knows. The quantization measures how precisely it remembers what it knows.

Both knobs matter. Cutting parameters by half is a much bigger loss than dropping from Q8 to Q4. That is why the community standardised on Q4_K_M as the default: it preserves the model’s knowledge at a small precision cost, and that trade-off is almost always the right one.

Key takeaways

A model is billions of small numbers (parameters). Each number can be stored more or less precisely.
FP16 is the reference precision. It is the way the model was trained.
Without quantization, local LLMs would be a research-only practice — a 70B model in FP16 needs around 160 GB to run.
Quantization replaces each FP16 weight by a smaller integer (Q4, Q5, Q6, Q8). The model is not retrained; only the storage changes.
Q4_K_M is Ollama’s production default. Its quality is within 1 – 3 percentage points of FP16 on standard benchmarks.
The failure order when precision drops: long-chain reasoning → competitive math → strict tool-call JSON → long-context recall → general chat (very robust).
Rule of thumb: a larger model at lower precision usually beats a smaller model at higher precision for the same memory budget.
Three-tier choice: laptops use Q4 (default); mid-range GPUs can go Q5 / Q6; workstation / GB10 benchmarks should use Q8 or F16 to isolate model quality from quantization noise.
Inspect with ollama show <model>; pick a quantization explicitly with the tag (:8b-instruct-q8_0, etc.).
The mental model: parameter count = how much it knows; quantization = how precisely it remembers what it knows.