Skip to content

Annex — Quantization, from scratch

Duration: 20 min Pre-requisite: chapter 05c (vocabulary on parameters, RAM, VRAM)

Quantization is the single most misunderstood part of running local LLMs. It explains why llama3.1:8b fits on a laptop with 8 GB of RAM despite having 8 billion parameters, why qwen2.5-coder:32b is around 20 GB and not 64 GB on disk, and why Ollama’s default Q4_K_M is production-grade, not a degraded mode.

This annex takes 20 minutes to read and gives every member of the workshop the same mental model. It is organised in five sections:

  1. What a parameter is, and why precision matters.
  2. The footprint problem — why FP16 makes 70B models unusable on consumer hardware.
  3. Quantization itself — the mechanics, the naming convention, a worked numerical example.
  4. The quality cost — what gets lost at Q4 vs Q8 vs FP16, on which kinds of tasks.
  5. A decision rule — which quantization to pick for which machine, with the Ollama commands.

1. What a parameter is — and why “precision” matters

Section titled “1. What a parameter is — and why “precision” matters”

A modern LLM is a network of billions of small numerical weights. Each weight is a real number between roughly −10 and +10, learned during training. When the model generates text, it multiplies and adds these numbers billions of times per token.

A few examples of what a weight might actually be:

weight_42 = 0.7384
weight_2_310 = -1.2055
weight_8_991_007 = 0.0001

That is the whole content of a model. Billions of numbers like these. A 7-billion-parameter model holds 7 × 10⁹ of them. A 70B model holds 70 × 10⁹.

A number like 0.7384 can be stored in many different ways inside the computer:

EncodingBits usedApproximate value stored
FP16 (half precision)16 bits0.7383 (essentially exact)
FP88 bits0.74
INT4 / Q44 bits~0.75
INT2 / Q22 bits~0.67 or ~0.83 (very coarse)

The fewer bits used to represent each weight, the less precise the value, but the less memory the model needs.

That is the entire idea of quantization.


The reference precision used during training is FP16 (16 bits = 2 bytes per weight) or BF16 (Brain-Float 16, also 16 bits, with a layout better suited to deep learning).

For a 7B model:

7 × 10⁹ weights × 2 bytes = 14 GB just for the weights

For a 70B model:

70 × 10⁹ weights × 2 bytes = 140 GB just for the weights

These numbers are before any context buffer, attention cache, runtime overhead or operating-system memory. With overhead, a FP16 70B model needs roughly 160 – 180 GB of memory to run comfortably.

That is well beyond the memory of any consumer laptop. Even a high-end workstation with two RTX A6000 cards (48 GB each = 96 GB total VRAM) cannot host it without partial CPU offload.

Model sizeFP16 weightsPractical FP16 memory (with overhead)
3B~6 GB~8 – 10 GB
7B~14 GB~18 – 20 GB
8B~16 GB~20 – 24 GB
14B~28 GB~36 – 40 GB
32B~64 GB~80 – 96 GB
70B~140 GB~160 – 180 GB
405B~810 GBwell above 1 TB
671B~1.34 TBresearch infrastructure only

In FP16, even a 14B model exceeds what a standard workstation can host. Without quantization, local LLMs would be a research-only practice.


Quantization replaces each FP16 weight by a smaller integer or a smaller floating-point value. The model is grouped into small blocks of weights (typically 32 or 64 at a time). For each block, two numbers are stored: a scale (a multiplier) and a zero point (an offset). Each individual weight is then replaced by a low-bit integer that, multiplied by the scale and shifted by the zero point, approximately reconstructs the original value at inference time.

That is the entire idea. The model is not retrained. Only the way the weights are stored is changed.

Ollama uses the GGUF file format and the Q<n>_<type> naming convention:

TagBits per weightNotes
Q2_K~2 bitsExtreme compression; visible quality loss
Q3_K_S / Q3_K_M / Q3_K_L~3 bitsSmall (S), Medium (M), Large (L) K-quants
Q4_0 / Q4_1~4 bitsOlder 4-bit formats
Q4_K_M~4 bitsOllama’s default — production-grade for most tasks
Q5_K_S / Q5_K_M~5 bitsSweet spot when 4 bits feels too aggressive
Q6_K~6 bitsNear-imperceptible quality loss
Q8_0~8 bitsAlmost identical to FP16 on most benchmarks
F1616 bitsThe reference (no quantization)

The K in Q4_K_M, Q5_K_S etc. refers to K-quants — a family of methods that distribute precision better across the model (giving more bits to the layers that need them most and fewer bits to the layers that tolerate compression well). The suffix (_S, _M, _L) is just Small / Medium / Large and controls how aggressive the compression is.

For 99 % of practical use, the only tag you need to know by name is Q4_K_M, and the only knob you adjust is the size class.

Take a single weight: weight_42 = 0.7384. Now look at what is actually stored at each precision.

PrecisionStorageValue reconstructed at inferenceAbsolute error
FP1616 bits, sign + 5-bit exponent + 10-bit mantissa0.7383~0.0001
Q88-bit integer per weight + block scale0.7373~0.0011
Q66-bit integer per weight + block scale0.7345~0.0039
Q55-bit integer per weight + block scale0.7188~0.0196
Q44-bit integer per weight + block scale0.7500~0.0116
Q33-bit integer per weight + block scale0.6875~0.0509
Q22-bit integer per weight + block scale0.6667~0.0717

A single weight loses a tiny amount of precision. But the model has billions of weights, and the arithmetic happens billions of times per token. The errors do compound — that is the whole point of the trade-off.

What rescues this in practice is that:

  • Errors are not all systematic in the same direction. Many cancel out across a layer.
  • The model is highly redundant. Many weights play similar roles. Losing precision on a few does not destroy the global behaviour.
  • K-quants spend more bits on the weights that matter most (attention layers, embedding tables) and fewer bits on the rest.

The end result, on Q4_K_M, is a model that:

  • Takes about 25 % of the FP16 file size.
  • Scores within 1 – 3 percentage points of FP16 on standard benchmarks.
  • Is indistinguishable for most users on chat, code completion, summarisation.

3.4 The footprint table — same model, different quantizations

Section titled “3.4 The footprint table — same model, different quantizations”

This is the table the workshop needs on the screen. Same model parameter count, different quantization choices, very different memory budgets.

ParametersFP16Q8Q6Q5Q4 (default)Q3Q2
1B~2.0 GB~1.0 GB~0.8 GB~0.7 GB~0.6 GB~0.5 GB~0.4 GB
3B~6.0 GB~3.0 GB~2.4 GB~2.0 GB~1.7 GB~1.4 GB~1.1 GB
7B~14 GB~7.0 GB~5.5 GB~4.8 GB~4.0 GB~3.3 GB~2.6 GB
8B~16 GB~8.0 GB~6.3 GB~5.5 GB~4.6 GB~3.8 GB~3.0 GB
14B~28 GB~14 GB~11 GB~9.5 GB~8.0 GB~6.5 GB~5.2 GB
32B~64 GB~32 GB~26 GB~22 GB~19 GB~15 GB~12 GB
70B~140 GB~70 GB~56 GB~48 GB~40 GB~33 GB~26 GB
405B~810 GB~405 GB~324 GB~280 GB~230 GB~190 GB~150 GB

4. The quality cost — what we actually observe

Section titled “4. The quality cost — what we actually observe”

Quantization is not free, but on most workshop-grade tasks the cost is small. The table below summarises what the open-source community measures on standard benchmarks (MMLU, HumanEval, GSM8K, etc.) compared to the FP16 reference.

QuantizationTypical benchmark drop vs FP16Practical perception
F160 (reference)The training-time quality
Q8< 0.5 percentage pointsIndistinguishable from FP16 on chat, code, summarisation
Q6~0.5 – 1 ppIndistinguishable on most tasks; tiny edge-case differences
Q5~1 – 2 ppIndistinguishable to most users; visible on hard reasoning prompts
Q4 (default)~1 – 3 ppProduction-grade for chat, code, summarisation, RAG. Visible on long-chain reasoning or competitive math.
Q3~3 – 6 ppVisible quality drop; acceptable only when memory is severely constrained
Q2~6 – 12 ppVisible drop on almost everything; emergency setting

When precision drops, the failures appear in a predictable order:

  1. Long multi-step reasoning (chains of 5+ deductive steps) — degrades from Q5 down.
  2. Competitive mathematics and exact arithmetic — degrades from Q5 down.
  3. Code generation with strict formatting (JSON tool calls, for example — see chapter 05b on qwen2.5-coder:7b’s tool-call escaping) — degrades from Q4 down.
  4. Long-context recall (asking about something said 20 K tokens earlier) — degrades from Q4 down.
  5. General chat, summarisation, simple translation, basic code completion — robust down to Q3, sometimes Q2.

For the workshop demos in this course (chat, simple Java agents, comparator, prompt engineering), Q4_K_M is fine for everything.

A larger model in a lower precision usually beats a smaller model in a higher precision.

Concretely, on the same memory budget:

Memory budgetOption AOption BUsually better
~5 GBllama3.1:8b Q4llama3.2:3b Q8A (the 8B at Q4)
~10 GBqwen2.5-coder:14b Q4qwen2.5-coder:7b Q8A (the 14B at Q4)
~25 GBqwen2.5-coder:32b Q4qwen2.5-coder:14b Q8A (the 32B at Q4)
~50 GBllama3.1:70b Q4qwen2.5-coder:32b Q8A (the 70B at Q4)

This is why Ollama defaults to Q4_K_M: it lets users run the largest possible model on their available hardware, at a quality cost that is almost always smaller than the gain from going up one parameter tier.


  • Default: Q4_K_M. That is what ollama pull <model> gives you. Trust it.
  • Do not go below Q4 on this tier — the quality drop becomes visible.
  • Do not go above Q4 unless you have specific evidence that the task fails on Q4 (typically hard reasoning or competitive math).
  • For tool-calling agents (chapter 09), Q4_K_M of llama3.1:8b is the validated configuration.

5.2 Inspecting and pulling a specific quantization in Ollama

Section titled “5.2 Inspecting and pulling a specific quantization in Ollama”

Ollama tags include the quantization. Default llama3.1:8b resolves to llama3.1:8b-instruct-q4_K_M. To pick something else explicitly:

Terminal window
# Default — Q4_K_M
ollama pull llama3.1:8b
# Explicit Q8_0 (better quality, double the size)
ollama pull llama3.1:8b-instruct-q8_0
# Explicit Q5_K_M
ollama pull llama3.1:8b-instruct-q5_K_M
# F16 (the reference; 14 GB on disk for the 8B)
ollama pull llama3.1:8b-instruct-fp16

To inspect what you have actually pulled:

Terminal window
ollama show llama3.1:8b

The output reports the architecture, the parameter count, the quantization, the context length, and the embedding length. The quantization line tells you exactly which precision is on disk.

Cliché you may hearReality
”Q4 is a degraded mode for laptops.”No. Q4_K_M is Ollama’s production default and the configuration used in most public deployments of local LLMs.
”More bits is always better.”Above Q6 / Q8, the gain is statistically real but not perceptible on most workshop tasks.
”Quantizing makes the model dumber.”For chat, code completion and summarisation, no. For long-chain reasoning and competitive math, slightly.
”Quantizing changes the parameter count.”No. The number of weights is unchanged. Only the way each weight is encoded is different.
”A 14B-Q8 must be better than a 32B-Q4.”Usually false. On the same memory budget, the larger model at lower precision wins.
”I can run a 70B model on my laptop, I’ll just use Q2.”The math says you can fit it. The quality at Q2 makes it unusable for anything past basic chat. Pick a smaller model at Q4 instead.

A useful way to picture it, in two sentences:

The parameter count measures how much the model knows. The quantization measures how precisely it remembers what it knows.

Both knobs matter. Cutting parameters by half is a much bigger loss than dropping from Q8 to Q4. That is why the community standardised on Q4_K_M as the default: it preserves the model’s knowledge at a small precision cost, and that trade-off is almost always the right one.


  • A model is billions of small numbers (parameters). Each number can be stored more or less precisely.
  • FP16 is the reference precision. It is the way the model was trained.
  • Without quantization, local LLMs would be a research-only practice — a 70B model in FP16 needs around 160 GB to run.
  • Quantization replaces each FP16 weight by a smaller integer (Q4, Q5, Q6, Q8). The model is not retrained; only the storage changes.
  • Q4_K_M is Ollama’s production default. Its quality is within 1 – 3 percentage points of FP16 on standard benchmarks.
  • The failure order when precision drops: long-chain reasoning → competitive math → strict tool-call JSON → long-context recall → general chat (very robust).
  • Rule of thumb: a larger model at lower precision usually beats a smaller model at higher precision for the same memory budget.
  • Three-tier choice: laptops use Q4 (default); mid-range GPUs can go Q5 / Q6; workstation / GB10 benchmarks should use Q8 or F16 to isolate model quality from quantization noise.
  • Inspect with ollama show <model>; pick a quantization explicitly with the tag (:8b-instruct-q8_0, etc.).
  • The mental model: parameter count = how much it knows; quantization = how precisely it remembers what it knows.