Annex — Quantization, from scratch
Duration: 20 min Pre-requisite: chapter 05c (vocabulary on parameters, RAM, VRAM)
Why this annex exists
Section titled “Why this annex exists”Quantization is the single most misunderstood part of running local LLMs. It explains why llama3.1:8b fits on a laptop with 8 GB of RAM despite having 8 billion parameters, why qwen2.5-coder:32b is around 20 GB and not 64 GB on disk, and why Ollama’s default Q4_K_M is production-grade, not a degraded mode.
This annex takes 20 minutes to read and gives every member of the workshop the same mental model. It is organised in five sections:
- What a parameter is, and why precision matters.
- The footprint problem — why FP16 makes 70B models unusable on consumer hardware.
- Quantization itself — the mechanics, the naming convention, a worked numerical example.
- The quality cost — what gets lost at Q4 vs Q8 vs FP16, on which kinds of tasks.
- A decision rule — which quantization to pick for which machine, with the Ollama commands.
1. What a parameter is — and why “precision” matters
Section titled “1. What a parameter is — and why “precision” matters”1.1 A parameter is just a number
Section titled “1.1 A parameter is just a number”A modern LLM is a network of billions of small numerical weights. Each weight is a real number between roughly −10 and +10, learned during training. When the model generates text, it multiplies and adds these numbers billions of times per token.
A few examples of what a weight might actually be:
weight_42 = 0.7384weight_2_310 = -1.2055weight_8_991_007 = 0.0001That is the whole content of a model. Billions of numbers like these. A 7-billion-parameter model holds 7 × 10⁹ of them. A 70B model holds 70 × 10⁹.
1.2 The precision question
Section titled “1.2 The precision question”A number like 0.7384 can be stored in many different ways inside the computer:
| Encoding | Bits used | Approximate value stored |
|---|---|---|
| FP16 (half precision) | 16 bits | 0.7383 (essentially exact) |
| FP8 | 8 bits | 0.74 |
| INT4 / Q4 | 4 bits | ~0.75 |
| INT2 / Q2 | 2 bits | ~0.67 or ~0.83 (very coarse) |
The fewer bits used to represent each weight, the less precise the value, but the less memory the model needs.
That is the entire idea of quantization.
2. The footprint problem
Section titled “2. The footprint problem”2.1 The naive cost — FP16
Section titled “2.1 The naive cost — FP16”The reference precision used during training is FP16 (16 bits = 2 bytes per weight) or BF16 (Brain-Float 16, also 16 bits, with a layout better suited to deep learning).
For a 7B model:
7 × 10⁹ weights × 2 bytes = 14 GB just for the weightsFor a 70B model:
70 × 10⁹ weights × 2 bytes = 140 GB just for the weightsThese numbers are before any context buffer, attention cache, runtime overhead or operating-system memory. With overhead, a FP16 70B model needs roughly 160 – 180 GB of memory to run comfortably.
That is well beyond the memory of any consumer laptop. Even a high-end workstation with two RTX A6000 cards (48 GB each = 96 GB total VRAM) cannot host it without partial CPU offload.
2.2 The footprint table for FP16
Section titled “2.2 The footprint table for FP16”| Model size | FP16 weights | Practical FP16 memory (with overhead) |
|---|---|---|
| 3B | ~6 GB | ~8 – 10 GB |
| 7B | ~14 GB | ~18 – 20 GB |
| 8B | ~16 GB | ~20 – 24 GB |
| 14B | ~28 GB | ~36 – 40 GB |
| 32B | ~64 GB | ~80 – 96 GB |
| 70B | ~140 GB | ~160 – 180 GB |
| 405B | ~810 GB | well above 1 TB |
| 671B | ~1.34 TB | research infrastructure only |
In FP16, even a 14B model exceeds what a standard workstation can host. Without quantization, local LLMs would be a research-only practice.
3. Quantization itself
Section titled “3. Quantization itself”3.1 The mechanics in one paragraph
Section titled “3.1 The mechanics in one paragraph”Quantization replaces each FP16 weight by a smaller integer or a smaller floating-point value. The model is grouped into small blocks of weights (typically 32 or 64 at a time). For each block, two numbers are stored: a scale (a multiplier) and a zero point (an offset). Each individual weight is then replaced by a low-bit integer that, multiplied by the scale and shifted by the zero point, approximately reconstructs the original value at inference time.
That is the entire idea. The model is not retrained. Only the way the weights are stored is changed.
3.2 The naming convention
Section titled “3.2 The naming convention”Ollama uses the GGUF file format and the Q<n>_<type> naming convention:
| Tag | Bits per weight | Notes |
|---|---|---|
Q2_K | ~2 bits | Extreme compression; visible quality loss |
Q3_K_S / Q3_K_M / Q3_K_L | ~3 bits | Small (S), Medium (M), Large (L) K-quants |
Q4_0 / Q4_1 | ~4 bits | Older 4-bit formats |
Q4_K_M | ~4 bits | Ollama’s default — production-grade for most tasks |
Q5_K_S / Q5_K_M | ~5 bits | Sweet spot when 4 bits feels too aggressive |
Q6_K | ~6 bits | Near-imperceptible quality loss |
Q8_0 | ~8 bits | Almost identical to FP16 on most benchmarks |
F16 | 16 bits | The reference (no quantization) |
The K in Q4_K_M, Q5_K_S etc. refers to K-quants — a family of methods that distribute precision better across the model (giving more bits to the layers that need them most and fewer bits to the layers that tolerate compression well). The suffix (_S, _M, _L) is just Small / Medium / Large and controls how aggressive the compression is.
For 99 % of practical use, the only tag you need to know by name is Q4_K_M, and the only knob you adjust is the size class.
3.3 A worked numerical example
Section titled “3.3 A worked numerical example”Take a single weight: weight_42 = 0.7384. Now look at what is actually stored at each precision.
| Precision | Storage | Value reconstructed at inference | Absolute error |
|---|---|---|---|
| FP16 | 16 bits, sign + 5-bit exponent + 10-bit mantissa | 0.7383 | ~0.0001 |
| Q8 | 8-bit integer per weight + block scale | 0.7373 | ~0.0011 |
| Q6 | 6-bit integer per weight + block scale | 0.7345 | ~0.0039 |
| Q5 | 5-bit integer per weight + block scale | 0.7188 | ~0.0196 |
| Q4 | 4-bit integer per weight + block scale | 0.7500 | ~0.0116 |
| Q3 | 3-bit integer per weight + block scale | 0.6875 | ~0.0509 |
| Q2 | 2-bit integer per weight + block scale | 0.6667 | ~0.0717 |
A single weight loses a tiny amount of precision. But the model has billions of weights, and the arithmetic happens billions of times per token. The errors do compound — that is the whole point of the trade-off.
What rescues this in practice is that:
- Errors are not all systematic in the same direction. Many cancel out across a layer.
- The model is highly redundant. Many weights play similar roles. Losing precision on a few does not destroy the global behaviour.
- K-quants spend more bits on the weights that matter most (attention layers, embedding tables) and fewer bits on the rest.
The end result, on Q4_K_M, is a model that:
- Takes about 25 % of the FP16 file size.
- Scores within 1 – 3 percentage points of FP16 on standard benchmarks.
- Is indistinguishable for most users on chat, code completion, summarisation.
3.4 The footprint table — same model, different quantizations
Section titled “3.4 The footprint table — same model, different quantizations”This is the table the workshop needs on the screen. Same model parameter count, different quantization choices, very different memory budgets.
| Parameters | FP16 | Q8 | Q6 | Q5 | Q4 (default) | Q3 | Q2 |
|---|---|---|---|---|---|---|---|
| 1B | ~2.0 GB | ~1.0 GB | ~0.8 GB | ~0.7 GB | ~0.6 GB | ~0.5 GB | ~0.4 GB |
| 3B | ~6.0 GB | ~3.0 GB | ~2.4 GB | ~2.0 GB | ~1.7 GB | ~1.4 GB | ~1.1 GB |
| 7B | ~14 GB | ~7.0 GB | ~5.5 GB | ~4.8 GB | ~4.0 GB | ~3.3 GB | ~2.6 GB |
| 8B | ~16 GB | ~8.0 GB | ~6.3 GB | ~5.5 GB | ~4.6 GB | ~3.8 GB | ~3.0 GB |
| 14B | ~28 GB | ~14 GB | ~11 GB | ~9.5 GB | ~8.0 GB | ~6.5 GB | ~5.2 GB |
| 32B | ~64 GB | ~32 GB | ~26 GB | ~22 GB | ~19 GB | ~15 GB | ~12 GB |
| 70B | ~140 GB | ~70 GB | ~56 GB | ~48 GB | ~40 GB | ~33 GB | ~26 GB |
| 405B | ~810 GB | ~405 GB | ~324 GB | ~280 GB | ~230 GB | ~190 GB | ~150 GB |
4. The quality cost — what we actually observe
Section titled “4. The quality cost — what we actually observe”4.1 The big picture
Section titled “4.1 The big picture”Quantization is not free, but on most workshop-grade tasks the cost is small. The table below summarises what the open-source community measures on standard benchmarks (MMLU, HumanEval, GSM8K, etc.) compared to the FP16 reference.
| Quantization | Typical benchmark drop vs FP16 | Practical perception |
|---|---|---|
| F16 | 0 (reference) | The training-time quality |
| Q8 | < 0.5 percentage points | Indistinguishable from FP16 on chat, code, summarisation |
| Q6 | ~0.5 – 1 pp | Indistinguishable on most tasks; tiny edge-case differences |
| Q5 | ~1 – 2 pp | Indistinguishable to most users; visible on hard reasoning prompts |
| Q4 (default) | ~1 – 3 pp | Production-grade for chat, code, summarisation, RAG. Visible on long-chain reasoning or competitive math. |
| Q3 | ~3 – 6 pp | Visible quality drop; acceptable only when memory is severely constrained |
| Q2 | ~6 – 12 pp | Visible drop on almost everything; emergency setting |
4.2 What gets affected first
Section titled “4.2 What gets affected first”When precision drops, the failures appear in a predictable order:
- Long multi-step reasoning (chains of 5+ deductive steps) — degrades from Q5 down.
- Competitive mathematics and exact arithmetic — degrades from Q5 down.
- Code generation with strict formatting (JSON tool calls, for example — see chapter 05b on
qwen2.5-coder:7b’s tool-call escaping) — degrades from Q4 down. - Long-context recall (asking about something said 20 K tokens earlier) — degrades from Q4 down.
- General chat, summarisation, simple translation, basic code completion — robust down to Q3, sometimes Q2.
For the workshop demos in this course (chat, simple Java agents, comparator, prompt engineering), Q4_K_M is fine for everything.
4.3 A useful mental shortcut
Section titled “4.3 A useful mental shortcut”A larger model in a lower precision usually beats a smaller model in a higher precision.
Concretely, on the same memory budget:
| Memory budget | Option A | Option B | Usually better |
|---|---|---|---|
| ~5 GB | llama3.1:8b Q4 | llama3.2:3b Q8 | A (the 8B at Q4) |
| ~10 GB | qwen2.5-coder:14b Q4 | qwen2.5-coder:7b Q8 | A (the 14B at Q4) |
| ~25 GB | qwen2.5-coder:32b Q4 | qwen2.5-coder:14b Q8 | A (the 32B at Q4) |
| ~50 GB | llama3.1:70b Q4 | qwen2.5-coder:32b Q8 | A (the 70B at Q4) |
This is why Ollama defaults to Q4_K_M: it lets users run the largest possible model on their available hardware, at a quality cost that is almost always smaller than the gain from going up one parameter tier.
5. A decision rule — how to choose
Section titled “5. A decision rule — how to choose”5.1 By machine tier
Section titled “5.1 By machine tier”- Default:
Q4_K_M. That is whatollama pull <model>gives you. Trust it. - Do not go below Q4 on this tier — the quality drop becomes visible.
- Do not go above Q4 unless you have specific evidence that the task fails on Q4 (typically hard reasoning or competitive math).
- For tool-calling agents (chapter 09), Q4_K_M of
llama3.1:8bis the validated configuration.
- Default:
Q4_K_Mfor any model the GPU can hold entirely. You will get full GPU speed. - Q5_K_M or Q6_K are reasonable on a 12 – 16 GB VRAM card if you have headroom — small quality bump, slight speed cost.
- Avoid mixing Q8 of one model and Q4 of another in a comparison: the precision difference confounds the model-quality comparison you are trying to make.
- For benchmarks that aim to compare model quality, prefer Q8_0 or F16 when memory allows. This isolates the model from the quantization noise.
- For production use (someone querying the model in a chat),
Q4_K_Mstill wins on speed and on cost-per-token, with a quality cost that is rarely visible. - For 70B and above, Q4_K_M is the practical choice even on a GB10 — F16 of 70B (~140 GB) does not fit in 128 GB of unified memory.
5.2 Inspecting and pulling a specific quantization in Ollama
Section titled “5.2 Inspecting and pulling a specific quantization in Ollama”Ollama tags include the quantization. Default llama3.1:8b resolves to llama3.1:8b-instruct-q4_K_M. To pick something else explicitly:
# Default — Q4_K_Mollama pull llama3.1:8b
# Explicit Q8_0 (better quality, double the size)ollama pull llama3.1:8b-instruct-q8_0
# Explicit Q5_K_Mollama pull llama3.1:8b-instruct-q5_K_M
# F16 (the reference; 14 GB on disk for the 8B)ollama pull llama3.1:8b-instruct-fp16To inspect what you have actually pulled:
ollama show llama3.1:8bThe output reports the architecture, the parameter count, the quantization, the context length, and the embedding length. The quantization line tells you exactly which precision is on disk.
5.3 Pitfalls and clichés to avoid
Section titled “5.3 Pitfalls and clichés to avoid”| Cliché you may hear | Reality |
|---|---|
| ”Q4 is a degraded mode for laptops.” | No. Q4_K_M is Ollama’s production default and the configuration used in most public deployments of local LLMs. |
| ”More bits is always better.” | Above Q6 / Q8, the gain is statistically real but not perceptible on most workshop tasks. |
| ”Quantizing makes the model dumber.” | For chat, code completion and summarisation, no. For long-chain reasoning and competitive math, slightly. |
| ”Quantizing changes the parameter count.” | No. The number of weights is unchanged. Only the way each weight is encoded is different. |
| ”A 14B-Q8 must be better than a 32B-Q4.” | Usually false. On the same memory budget, the larger model at lower precision wins. |
| ”I can run a 70B model on my laptop, I’ll just use Q2.” | The math says you can fit it. The quality at Q2 makes it unusable for anything past basic chat. Pick a smaller model at Q4 instead. |
6. A unifying mental model
Section titled “6. A unifying mental model”A useful way to picture it, in two sentences:
The parameter count measures how much the model knows. The quantization measures how precisely it remembers what it knows.
Both knobs matter. Cutting parameters by half is a much bigger loss than dropping from Q8 to Q4. That is why the community standardised on Q4_K_M as the default: it preserves the model’s knowledge at a small precision cost, and that trade-off is almost always the right one.
Key takeaways
Section titled “Key takeaways”- A model is billions of small numbers (parameters). Each number can be stored more or less precisely.
- FP16 is the reference precision. It is the way the model was trained.
- Without quantization, local LLMs would be a research-only practice — a 70B model in FP16 needs around 160 GB to run.
- Quantization replaces each FP16 weight by a smaller integer (Q4, Q5, Q6, Q8). The model is not retrained; only the storage changes.
- Q4_K_M is Ollama’s production default. Its quality is within 1 – 3 percentage points of FP16 on standard benchmarks.
- The failure order when precision drops: long-chain reasoning → competitive math → strict tool-call JSON → long-context recall → general chat (very robust).
- Rule of thumb: a larger model at lower precision usually beats a smaller model at higher precision for the same memory budget.
- Three-tier choice: laptops use Q4 (default); mid-range GPUs can go Q5 / Q6; workstation / GB10 benchmarks should use Q8 or F16 to isolate model quality from quantization noise.
- Inspect with
ollama show <model>; pick a quantization explicitly with the tag (:8b-instruct-q8_0, etc.). - The mental model: parameter count = how much it knows; quantization = how precisely it remembers what it knows.