Annex — Local LLM catalogue

Duration: 20 min Pre-requisite: chapter 05b

Why this annex exists

Chapter 05b is deliberately focused: it lists the three models validated on the course demos and tells the reader which one to pick by hardware class. This annex zooms out. It is meant as a reference catalogue for anyone who wants to understand the rest of the local LLM landscape — what “32B” really means, how qwen2.5-coder compares to granite-code or starcoder2, and what becomes possible on a workstation-class machine like a Dell Pro Max with GB10 (a Grace Blackwell GB10 platform, hardware-equivalent to an NVIDIA DGX Spark).

The annex is organised in four blocks:

Vocabulary — what “B”, “Q4”, “RAM”, “VRAM”, “unified memory” actually mean.
The catalogue — family by family, with parameter count, disk size, suggested RAM and VRAM.
Three workshop tiers — what runs on a standard laptop, a stronger desktop, and a GB10-class machine.
What a model is not — a clarification that a model alone is not an agent: tool use depends on the surrounding framework.

1. Vocabulary: what the model name actually tells you

1.1 “B” means billion parameters — not gigabytes

A common confusion in classrooms is to read qwen2.5-coder:32b and assume “32 GB on disk”. That is not what the name means. The B stands for billion parameters.

Name suffix	Approximate parameter count
`0.5b`	0.5 billion parameters (5 × 10⁸)
`1.5b`	1.5 billion parameters
`3b`	3 billion parameters
`7b`	7 billion parameters
`8b`	8 billion parameters
`14b`	14 billion parameters
`20b`	20 billion parameters
`32b`	32 billion parameters
`70b`	70 billion parameters
`405b`	405 billion parameters
`671b`	671 billion parameters

The parameters are the internal learned weights of the model. More parameters generally mean better reasoning, better code, more stable instruction-following — at the cost of more memory, more storage and more compute time.

1.2 Four memory concepts to distinguish

When a model card says “needs 8 GB”, that figure can refer to four different things. Mixing them up leads to wrong purchasing decisions.

Concept	What it is	When it matters
File size on disk	The size of the `.gguf` file Ollama downloads. Around 4.7 GB for `llama3.1:8b` (Q4).	When you `ollama pull` — disk space and download time.
RAM	System memory used when the model runs on the CPU. Roughly file size + 1 – 2 GB for context and runtime.	On a laptop with no usable GPU.
VRAM	GPU memory used when the model runs on a discrete graphics card. Same order of magnitude as RAM, but read/write is much faster.	When a CUDA/ROCm GPU is present.
Unified memory	A single memory pool shared by CPU and GPU without copy. Found on Apple Silicon, DGX Spark, Dell Pro Max GB10.	When the model is too big to fit in any single VRAM card but still fits in the shared pool.

A 70B model that needs about 48 GB simply does not fit in a 24 GB RTX 4090 — but fits comfortably in 128 GB of unified memory on a GB10 or M3 Max. Above roughly 30 B parameters, unified memory becomes the dominant constraint.

1.3 Quantization — Q4, Q8, FP16, BF16 (summary)

The full precision of a model is FP16 (16 bits per weight) or BF16. Quantization compresses each weight to fewer bits, which shrinks the file and reduces RAM/VRAM use at a measurable cost in quality.

Quantization	Bits per weight	Footprint (relative to FP16)	Quality cost	Typical use
Q4 (Q4_K_M is Ollama’s default)	~4 bits	~25 %	Small, often imperceptible	Laptops, classroom
Q5 / Q6	~5 – 6 bits	~35 – 45 %	Very small	Mid-range desktops
Q8	~8 bits	~50 %	Almost none on most tasks	Workstation, GPU >= 16 GB
FP16 / BF16	16 bits	100 %	None (reference)	Research, fine-tuning

A qwen2.5-coder:32b in Q4 (around 20 GB on disk) is reachable on a workstation with 48 GB of RAM, while the same model in FP16 (about 64 GB) is not.

2. The catalogue, family by family

All disk sizes refer to Ollama’s default tag (Q4_K_M unless otherwise stated). Suggested RAM and VRAM values are practical floors for comfortable live use — they include context, runtime overhead and a small safety margin.

2.1 Qwen2.5-Coder — code-specialized family

A code-specialized line from Alibaba. Available on Ollama in 0.5B, 1.5B, 3B, 7B, 14B, 32B. The 7B and 14B variants advertise the tools capability on the Ollama library page but emit their tool calls inside message.content rather than the structured field (see chapter 05b for the consequence on the agent demos).

Tag	Parameters	Disk	Suggested RAM	Suggested VRAM	Best for	Limitation
`qwen2.5-coder:0.5b`	0.5 B	~400 MB	2 – 4 GB	1 – 2 GB	Demonstrating that a local LLM can run almost anywhere	Too small for serious Java work or for the agent demos
`qwen2.5-coder:1.5b`	1.5 B	~1.0 GB	4 GB	2 GB	Tiny code-completion demos	Loses coherence on multi-file projects
`qwen2.5-coder:3b`	3 B	~1.9 GB	6 – 8 GB	3 – 4 GB	Standard laptop baseline; basic Java examples, error explanation	Drifts on larger tasks
`qwen2.5-coder:7b`	7 B	~4.7 GB	12 – 16 GB	6 – 8 GB	Documented alternative to `llama3.1:8b` for the agent demos (chapter 05b) — best practical balance for participants with decent machines	Tool calls land in `message.content`; needs the fallback parser of the demo code
`qwen2.5-coder:14b`	14 B	~9.0 GB	24 – 32 GB	12 – 16 GB	Stronger Java; visible quality jump over the 7B	Slow on CPU-only, same tool-call format issue
`qwen2.5-coder:32b`	32 B	~20 GB	48 – 64 GB	24 – 32 GB	Advanced coding benchmark, comparison with cloud tooling	Out of reach for laptops; realistic on a GB10-class machine

2.2 Llama 3.1 — general-purpose, reliable tool calling

Meta’s general-purpose family. Available on Ollama in 8B, 70B, 405B. Default 8B tag around 4.9 GB with a 128 K context window.

Tag	Parameters	Disk	Suggested RAM	Suggested VRAM	Best for	Limitation
`llama3.1:8b`	8 B	~4.9 GB	12 – 16 GB	6 – 8 GB	Course default for the agent demos — clean structured `tool_calls`, balanced quality on chat, reasoning, summarization	Less specialized for code than Qwen-Coder; good comparison point against a coding model
`llama3.1:70b`	70 B	~43 GB	96 – 128 GB	48 – 80 GB	Cited as “excellent” in the demo 4 README for workstation users; stronger planning and long answers	Not a laptop model
`llama3.1:405b`	405 B	~243 GB (standard format)	300 GB+ in a classical configuration	Very high; needs research-grade infrastructure	Showing the gap between local and frontier scales	Even a 128 GB GB10 cannot run the standard Ollama tag without aggressive quantization or special setup. Mention it conceptually rather than promise it.

2.3 DeepSeek-R1 — reasoning-focused family

A reasoning-oriented family. Ollama lists 1.5B, 7B, 8B, 14B, 32B, 70B, 671B.

Tag	Parameters	Disk	Suggested RAM	Suggested VRAM	Best for	Limitation
`deepseek-r1:1.5b`	1.5 B	~1.1 GB	4 GB	2 GB	Showing step-by-step reasoning on a weak machine	Too limited for production coding
`deepseek-r1:7b` / `:8b`	7 – 8 B	~4.7 GB	12 – 16 GB	6 – 8 GB	Comparing a reasoning-focused 7B against a coding-focused 7B	Less precise than Qwen-Coder for pure Java generation
`deepseek-r1:14b`	14 B	~9.0 GB	24 – 32 GB	12 – 16 GB	Planning, architecture discussion, debugging logic	Heavier; slower without GPU
`deepseek-r1:32b`	32 B	~20 GB	48 – 64 GB	24 – 32 GB	Advanced reasoning benchmark on a GB10-class machine	Out of reach for laptops
`deepseek-r1:70b`	70 B	~43 GB	96 – 128 GB	48 – 80 GB	Comparing advanced reasoning vs. specialised coding models	Workstation-only
`deepseek-r1:671b`	671 B	~404 GB	Far above 128 GB unified memory	Research-grade infrastructure	Conceptual reference for the absolute top of the family	Not a realistic Ollama target, even on a GB10

2.4 CodeGemma — Google’s code line

A coding-oriented line from Google’s Gemma family. Listed by Ollama in 2B and 7B, supporting fill-in-the-middle completion, code generation, instruction following.

Tag	Parameters	Disk	Suggested RAM	Suggested VRAM	Best for	Limitation
`codegemma:2b`	2 B	~1.6 GB	4 – 6 GB	2 – 4 GB	Code completion on weak machines, small examples	Not strong enough for complex agent workflows
`codegemma:7b`	7 B	~5.0 GB	12 – 16 GB	6 – 8 GB	Code completion, generation, instruction-following; useful comparison against `qwen2.5-coder:7b`	Smaller context than the most recent families

2.5 StarCoder2 — open code family

A code-focused open family. Listed by Ollama in 3B, 7B, 15B, with a 16 K context window.

Tag	Parameters	Disk	Suggested RAM	Suggested VRAM	Best for	Limitation
`starcoder2:3b`	3 B	~1.7 GB	6 – 8 GB	3 – 4 GB	Demonstrating a code-only family on a small machine	Not ideal for big Java projects
`starcoder2:7b`	7 B	~4.0 GB	12 – 16 GB	6 – 8 GB	Comparison across code-LLM families	Less conversational than modern instruct-tuned chat models
`starcoder2:15b`	15 B	~9.1 GB	24 – 32 GB	12 – 16 GB	Advanced code generation benchmark	Heavy; not for weak laptops

2.6 Granite-Code — IBM’s professional code line

IBM’s code-intelligence family. Listed by Ollama in 3B, 8B, 20B, 34B, with the 3B and 8B variants advertising a 128 K context window.

Tag	Parameters	Disk	Suggested RAM	Suggested VRAM	Best for	Limitation
`granite-code:3b`	3 B	~2.0 GB	6 – 8 GB	3 – 4 GB	Code generation, code explanation, code fixing; second baseline next to `qwen2.5-coder:3b`	Limited reasoning depth
`granite-code:8b`	8 B	~4.6 GB	12 – 16 GB	6 – 8 GB	Professional code-intelligence scenarios, long context	Comparison only; not the recommended agent demo model
`granite-code:20b`	20 B	~12 GB	32 – 48 GB	16 – 24 GB	Advanced code-generation benchmark	Out of reach for laptops
`granite-code:34b`	34 B	~19 GB	48 – 64 GB	24 – 32 GB	Strong local coding on a workstation or GB10	Hardware-bound

2.7 Gemma 3 — general / multimodal

A Google general-purpose family. Listed by Ollama in 270M, 1B, 4B, 12B, 27B. The 4B, 12B and 27B variants accept text and image input.

Tag	Parameters	Disk	Suggested RAM	Suggested VRAM	Best for	Limitation
`gemma3:1b`	1 B	~815 MB	4 GB	1 – 2 GB	Lightweight general AI demonstration	Not for coding
`gemma3:4b`	4 B	~3.3 GB	8 – 12 GB	4 – 6 GB	Chat, summarization, multimodal text + image examples	Not a coding-specialized model
`gemma3:12b`	12 B	~8.1 GB	20 – 24 GB	10 – 16 GB	Stronger general reasoning, multimodal demonstrations	Not the first choice for pure Java
`gemma3:27b`	27 B	~17 GB	40 – 64 GB	20 – 32 GB	Advanced multimodal demonstrations on a GB10	Workstation-bound

2.8 Mistral / Mixtral

A French open family. Ollama lists mistral as a 7B model around 4.4 GB with a 32 K context window; mixtral is a Mixture-of-Experts family with 8x7B and 8x22B variants.

Tag	Parameters	Disk	Suggested RAM	Suggested VRAM	Best for	Limitation
`mistral:7b`	7 B	~4.4 GB	12 – 16 GB	6 – 8 GB	Fast general chat, summarization, basic coding	Not as specialised for code as Qwen-Coder or StarCoder2
`mixtral:8x7b`	Mixture-of-Experts	~26 GB	64 – 96 GB	32 – 48 GB	Advanced benchmark on a workstation or GB10	”8x7B” does not mean 56 B active simultaneously, but the file footprint is much heavier than a plain 7B
`mixtral:8x22b`	Mixture-of-Experts	~80 GB	128 GB+	80 GB+ or multi-GPU	Highest end of the family	For specialised infrastructure only

2.9 CodeLlama — historical context

Meta’s earlier code-focused family. Listed by Ollama in 7B, 13B, 34B, 70B. It can generate and discuss code, but newer specialised families (Qwen-Coder, CodeGemma, StarCoder2, Granite-Code) usually outperform it on Ollama benchmarks.

Tag	Parameters	Use in 2026
`codellama:7b`	7 B	Historical reference; useful to show the progress of code LLMs
`codellama:13b`	13 B	Historical reference; not the recommended default
`codellama:34b` / `:70b`	34 B / 70 B	Benchmark or historical comparison only

3. Workshop recommendations by hardware tier

The same idea as in chapter 05b, expanded across the full catalogue.

For a live exercise where every participant must run the model locally:

qwen2.5-coder:3b — fast, small, runs on most laptops; good baseline for code completion.
qwen2.5-coder:7b — best practical compromise on 16 GB RAM; documented alternative for the agent demos.
mistral:7b — fast general chat and summarization.
starcoder2:3b — second code-family baseline for comparison.
gemma3:4b — adds a multimodal angle (text + image) at low cost.

Comparable on this tier: qwen2.5-coder:3b vs starcoder2:3b vs granite-code:3b. Same task, three families.

For participants with a workstation laptop or a desktop with a discrete GPU:

qwen2.5-coder:14b — visibly stronger code quality than the 7B.
codegemma:7b — useful side-by-side against qwen2.5-coder:7b.
starcoder2:7b — third code family for the comparison.
granite-code:8b — IBM professional code-intelligence variant with 128 K context.
deepseek-r1:14b — reasoning-focused, useful to show planning and debugging logic.

Comparable on this tier: qwen2.5-coder:7b vs codegemma:7b vs starcoder2:7b vs granite-code:8b. Four code families at similar parameter counts.

For a department machine built around the Grace Blackwell GB10 superchip (128 GB unified memory, around 1 PFLOP in FP4):

qwen2.5-coder:32b — advanced coding benchmark on the same task as the 7B and 14B; visible quality jump.
deepseek-r1:32b — reasoning-first benchmark; compare side-by-side with qwen2.5-coder:32b.
granite-code:20b and granite-code:34b — IBM at workstation scale.
llama3.1:70b — explicitly cited as “excellent” in the demo 4 README; drop-in upgrade of llama3.1:8b.
mixtral:8x7b — Mixture-of-Experts comparison.

Be careful with the very largest tags. llama3.1:405b (about 243 GB) and deepseek-r1:671b (about 404 GB) do not fit in 128 GB of unified memory without aggressive quantization or special infrastructure. They belong in the conceptual section of the workshop, not in the guaranteed setup.

4. A model is not an agent by itself

A point worth making explicit in any workshop. The model provides language and reasoning capacity. It does not by itself read files, run commands, call APIs, or compile Java code. Those actions come from the agent framework wrapped around the model.

Layer	What it does	Examples
Local runtime	Loads the model into memory and exposes a chat / completion API on `127.0.0.1`.	Ollama, llama.cpp, vLLM, LM Studio
Tool-calling protocol	Defines how the model declares “I want to call `read_file` with these arguments”.	Ollama’s native `tool_calls` field, OpenAI-style function calling
Agent framework	Implements the loop (call model → execute tool → feed result back → repeat) and the sandbox in which tools run.	LangChain, LangGraph, OpenWebUI, Continue, OpenCode, the custom Python loop in `ollama-demo-3-agent-java/agent_java.py`
Integration / UI	Exposes the agent to a user, often as a chat panel, editor extension, or web UI.	Continue (VS Code), OpenWebUI, Streamlit (used in demos 1, 2, 4)

The course demos use Ollama for the runtime, Ollama’s native tool calling for the protocol, and a custom Python loop for the agent framework. There is no LangChain, no LangGraph, no Continue. Reading agent_java.py is sufficient to understand the whole stack.

5. A reproducible benchmark protocol for a GB10-class machine

If a department has access to a Dell Pro Max with GB10 (or an equivalent DGX Spark class machine), a useful contribution to the workshop is a side-by-side benchmark on the same Java agent task as demos 3 and 4. The protocol below is reproducible and yields directly comparable numbers.

5.1 Inputs

The identical prompt of demo 3 (creating Product.java, ProductManager.java, Main.java, then compiling).
The identical tool set (list_files, read_file, write_file, compile_java).
The identical loop (MAX_STEPS = 10, fallback parser enabled).
A list of candidate models spanning size and family — for example: qwen2.5-coder:3b, :7b, :14b, :32b; llama3.1:8b, :70b; granite-code:8b, :20b, :34b; deepseek-r1:14b, :32b; mixtral:8x7b.

5.2 Metrics to capture, per model

Metric	How to measure
Time to first token	Wall-clock between request and first response chunk
Tokens per second	Total output tokens / generation time
Number of agent steps	How many model turns before the task is solved or `MAX_STEPS` is reached
Structured `tool_calls` ratio	Calls in the structured field / total parsed calls (the rest comes from the fallback parser)
Files created on first try	0, 1, 2 or 3 of the expected `.java` files
Compilation success	`javac` returns 0
Peak RAM and VRAM	`nvidia-smi` and `top` / `htop` samples during the run
Disk footprint	Output of `ollama show <model>`

5.3 Reporting

A single CSV file with one row per model, plus a short observation note per row (typical failure mode, code style, anything that is not a number). Three plots cover the rest:

Tokens per second vs. parameter count.
Files-created ratio vs. parameter count, split by family.
Compilation success rate vs. parameter count.

5.4 Pedagogical use

The same Java prompt running on qwen2.5-coder:7b, qwen2.5-coder:32b and llama3.1:70b produces three observably different outputs. Showing the three outputs side by side — with the timing and the file-creation ratio — illustrates the trade-off between size, specialization, and speed more clearly than any abstract explanation.

Key takeaways

B = billion parameters, not gigabytes. Disk size depends on quantization.
Four memory concepts to keep separate: file size on disk, RAM (CPU mode), VRAM (GPU mode), unified memory (GB10, Apple Silicon).
Quantization (Q4 / Q8 / FP16 / BF16) controls the trade-off between footprint and quality.
No single model fits every use. Code-specialised models (qwen2.5-coder, codegemma, starcoder2, granite-code) shine on Java; general models (llama3.1, mistral, gemma3) shine on chat, summarisation and reasoning; reasoning-focused models (deepseek-r1) shine on multi-step logic.
Three workshop tiers — standard laptops, stronger participant machines, GB10-class workstations — each have their own short list. Pick by hardware first, by task second.
A model is not an agent by itself. Tool use, file operations and command execution come from the agent framework (Ollama-native, LangChain, LangGraph, OpenWebUI, custom Python). The model only provides language and reasoning.
The 405B and 671B tags are conceptual. Their standard Ollama formats exceed a 128 GB unified-memory machine; mention them, do not promise to run them.