Annex — Local LLM catalogue
Duration: 20 min Pre-requisite: chapter 05b
Why this annex exists
Section titled “Why this annex exists”Chapter 05b is deliberately focused: it lists the three models validated on the course demos and tells the reader which one to pick by hardware class. This annex zooms out. It is meant as a reference catalogue for anyone who wants to understand the rest of the local LLM landscape — what “32B” really means, how qwen2.5-coder compares to granite-code or starcoder2, and what becomes possible on a workstation-class machine like a Dell Pro Max with GB10 (a Grace Blackwell GB10 platform, hardware-equivalent to an NVIDIA DGX Spark).
The annex is organised in four blocks:
- Vocabulary — what “B”, “Q4”, “RAM”, “VRAM”, “unified memory” actually mean.
- The catalogue — family by family, with parameter count, disk size, suggested RAM and VRAM.
- Three workshop tiers — what runs on a standard laptop, a stronger desktop, and a GB10-class machine.
- What a model is not — a clarification that a model alone is not an agent: tool use depends on the surrounding framework.
1. Vocabulary: what the model name actually tells you
Section titled “1. Vocabulary: what the model name actually tells you”1.1 “B” means billion parameters — not gigabytes
Section titled “1.1 “B” means billion parameters — not gigabytes”A common confusion in classrooms is to read qwen2.5-coder:32b and assume “32 GB on disk”. That is not what the name means. The B stands for billion parameters.
| Name suffix | Approximate parameter count |
|---|---|
0.5b | 0.5 billion parameters (5 × 10⁸) |
1.5b | 1.5 billion parameters |
3b | 3 billion parameters |
7b | 7 billion parameters |
8b | 8 billion parameters |
14b | 14 billion parameters |
20b | 20 billion parameters |
32b | 32 billion parameters |
70b | 70 billion parameters |
405b | 405 billion parameters |
671b | 671 billion parameters |
The parameters are the internal learned weights of the model. More parameters generally mean better reasoning, better code, more stable instruction-following — at the cost of more memory, more storage and more compute time.
1.2 Four memory concepts to distinguish
Section titled “1.2 Four memory concepts to distinguish”When a model card says “needs 8 GB”, that figure can refer to four different things. Mixing them up leads to wrong purchasing decisions.
| Concept | What it is | When it matters |
|---|---|---|
| File size on disk | The size of the .gguf file Ollama downloads. Around 4.7 GB for llama3.1:8b (Q4). | When you ollama pull — disk space and download time. |
| RAM | System memory used when the model runs on the CPU. Roughly file size + 1 – 2 GB for context and runtime. | On a laptop with no usable GPU. |
| VRAM | GPU memory used when the model runs on a discrete graphics card. Same order of magnitude as RAM, but read/write is much faster. | When a CUDA/ROCm GPU is present. |
| Unified memory | A single memory pool shared by CPU and GPU without copy. Found on Apple Silicon, DGX Spark, Dell Pro Max GB10. | When the model is too big to fit in any single VRAM card but still fits in the shared pool. |
A 70B model that needs about 48 GB simply does not fit in a 24 GB RTX 4090 — but fits comfortably in 128 GB of unified memory on a GB10 or M3 Max. Above roughly 30 B parameters, unified memory becomes the dominant constraint.
1.3 Quantization — Q4, Q8, FP16, BF16 (summary)
Section titled “1.3 Quantization — Q4, Q8, FP16, BF16 (summary)”The full precision of a model is FP16 (16 bits per weight) or BF16. Quantization compresses each weight to fewer bits, which shrinks the file and reduces RAM/VRAM use at a measurable cost in quality.
| Quantization | Bits per weight | Footprint (relative to FP16) | Quality cost | Typical use |
|---|---|---|---|---|
| Q4 (Q4_K_M is Ollama’s default) | ~4 bits | ~25 % | Small, often imperceptible | Laptops, classroom |
| Q5 / Q6 | ~5 – 6 bits | ~35 – 45 % | Very small | Mid-range desktops |
| Q8 | ~8 bits | ~50 % | Almost none on most tasks | Workstation, GPU >= 16 GB |
| FP16 / BF16 | 16 bits | 100 % | None (reference) | Research, fine-tuning |
A qwen2.5-coder:32b in Q4 (around 20 GB on disk) is reachable on a workstation with 48 GB of RAM, while the same model in FP16 (about 64 GB) is not.
2. The catalogue, family by family
Section titled “2. The catalogue, family by family”All disk sizes refer to Ollama’s default tag (Q4_K_M unless otherwise stated). Suggested RAM and VRAM values are practical floors for comfortable live use — they include context, runtime overhead and a small safety margin.
2.1 Qwen2.5-Coder — code-specialized family
Section titled “2.1 Qwen2.5-Coder — code-specialized family”A code-specialized line from Alibaba. Available on Ollama in 0.5B, 1.5B, 3B, 7B, 14B, 32B. The 7B and 14B variants advertise the tools capability on the Ollama library page but emit their tool calls inside message.content rather than the structured field (see chapter 05b for the consequence on the agent demos).
| Tag | Parameters | Disk | Suggested RAM | Suggested VRAM | Best for | Limitation |
|---|---|---|---|---|---|---|
qwen2.5-coder:0.5b | 0.5 B | ~400 MB | 2 – 4 GB | 1 – 2 GB | Demonstrating that a local LLM can run almost anywhere | Too small for serious Java work or for the agent demos |
qwen2.5-coder:1.5b | 1.5 B | ~1.0 GB | 4 GB | 2 GB | Tiny code-completion demos | Loses coherence on multi-file projects |
qwen2.5-coder:3b | 3 B | ~1.9 GB | 6 – 8 GB | 3 – 4 GB | Standard laptop baseline; basic Java examples, error explanation | Drifts on larger tasks |
qwen2.5-coder:7b | 7 B | ~4.7 GB | 12 – 16 GB | 6 – 8 GB | Documented alternative to llama3.1:8b for the agent demos (chapter 05b) — best practical balance for participants with decent machines | Tool calls land in message.content; needs the fallback parser of the demo code |
qwen2.5-coder:14b | 14 B | ~9.0 GB | 24 – 32 GB | 12 – 16 GB | Stronger Java; visible quality jump over the 7B | Slow on CPU-only, same tool-call format issue |
qwen2.5-coder:32b | 32 B | ~20 GB | 48 – 64 GB | 24 – 32 GB | Advanced coding benchmark, comparison with cloud tooling | Out of reach for laptops; realistic on a GB10-class machine |
2.2 Llama 3.1 — general-purpose, reliable tool calling
Section titled “2.2 Llama 3.1 — general-purpose, reliable tool calling”Meta’s general-purpose family. Available on Ollama in 8B, 70B, 405B. Default 8B tag around 4.9 GB with a 128 K context window.
| Tag | Parameters | Disk | Suggested RAM | Suggested VRAM | Best for | Limitation |
|---|---|---|---|---|---|---|
llama3.1:8b | 8 B | ~4.9 GB | 12 – 16 GB | 6 – 8 GB | Course default for the agent demos — clean structured tool_calls, balanced quality on chat, reasoning, summarization | Less specialized for code than Qwen-Coder; good comparison point against a coding model |
llama3.1:70b | 70 B | ~43 GB | 96 – 128 GB | 48 – 80 GB | Cited as “excellent” in the demo 4 README for workstation users; stronger planning and long answers | Not a laptop model |
llama3.1:405b | 405 B | ~243 GB (standard format) | 300 GB+ in a classical configuration | Very high; needs research-grade infrastructure | Showing the gap between local and frontier scales | Even a 128 GB GB10 cannot run the standard Ollama tag without aggressive quantization or special setup. Mention it conceptually rather than promise it. |
2.3 DeepSeek-R1 — reasoning-focused family
Section titled “2.3 DeepSeek-R1 — reasoning-focused family”A reasoning-oriented family. Ollama lists 1.5B, 7B, 8B, 14B, 32B, 70B, 671B.
| Tag | Parameters | Disk | Suggested RAM | Suggested VRAM | Best for | Limitation |
|---|---|---|---|---|---|---|
deepseek-r1:1.5b | 1.5 B | ~1.1 GB | 4 GB | 2 GB | Showing step-by-step reasoning on a weak machine | Too limited for production coding |
deepseek-r1:7b / :8b | 7 – 8 B | ~4.7 GB | 12 – 16 GB | 6 – 8 GB | Comparing a reasoning-focused 7B against a coding-focused 7B | Less precise than Qwen-Coder for pure Java generation |
deepseek-r1:14b | 14 B | ~9.0 GB | 24 – 32 GB | 12 – 16 GB | Planning, architecture discussion, debugging logic | Heavier; slower without GPU |
deepseek-r1:32b | 32 B | ~20 GB | 48 – 64 GB | 24 – 32 GB | Advanced reasoning benchmark on a GB10-class machine | Out of reach for laptops |
deepseek-r1:70b | 70 B | ~43 GB | 96 – 128 GB | 48 – 80 GB | Comparing advanced reasoning vs. specialised coding models | Workstation-only |
deepseek-r1:671b | 671 B | ~404 GB | Far above 128 GB unified memory | Research-grade infrastructure | Conceptual reference for the absolute top of the family | Not a realistic Ollama target, even on a GB10 |
2.4 CodeGemma — Google’s code line
Section titled “2.4 CodeGemma — Google’s code line”A coding-oriented line from Google’s Gemma family. Listed by Ollama in 2B and 7B, supporting fill-in-the-middle completion, code generation, instruction following.
| Tag | Parameters | Disk | Suggested RAM | Suggested VRAM | Best for | Limitation |
|---|---|---|---|---|---|---|
codegemma:2b | 2 B | ~1.6 GB | 4 – 6 GB | 2 – 4 GB | Code completion on weak machines, small examples | Not strong enough for complex agent workflows |
codegemma:7b | 7 B | ~5.0 GB | 12 – 16 GB | 6 – 8 GB | Code completion, generation, instruction-following; useful comparison against qwen2.5-coder:7b | Smaller context than the most recent families |
2.5 StarCoder2 — open code family
Section titled “2.5 StarCoder2 — open code family”A code-focused open family. Listed by Ollama in 3B, 7B, 15B, with a 16 K context window.
| Tag | Parameters | Disk | Suggested RAM | Suggested VRAM | Best for | Limitation |
|---|---|---|---|---|---|---|
starcoder2:3b | 3 B | ~1.7 GB | 6 – 8 GB | 3 – 4 GB | Demonstrating a code-only family on a small machine | Not ideal for big Java projects |
starcoder2:7b | 7 B | ~4.0 GB | 12 – 16 GB | 6 – 8 GB | Comparison across code-LLM families | Less conversational than modern instruct-tuned chat models |
starcoder2:15b | 15 B | ~9.1 GB | 24 – 32 GB | 12 – 16 GB | Advanced code generation benchmark | Heavy; not for weak laptops |
2.6 Granite-Code — IBM’s professional code line
Section titled “2.6 Granite-Code — IBM’s professional code line”IBM’s code-intelligence family. Listed by Ollama in 3B, 8B, 20B, 34B, with the 3B and 8B variants advertising a 128 K context window.
| Tag | Parameters | Disk | Suggested RAM | Suggested VRAM | Best for | Limitation |
|---|---|---|---|---|---|---|
granite-code:3b | 3 B | ~2.0 GB | 6 – 8 GB | 3 – 4 GB | Code generation, code explanation, code fixing; second baseline next to qwen2.5-coder:3b | Limited reasoning depth |
granite-code:8b | 8 B | ~4.6 GB | 12 – 16 GB | 6 – 8 GB | Professional code-intelligence scenarios, long context | Comparison only; not the recommended agent demo model |
granite-code:20b | 20 B | ~12 GB | 32 – 48 GB | 16 – 24 GB | Advanced code-generation benchmark | Out of reach for laptops |
granite-code:34b | 34 B | ~19 GB | 48 – 64 GB | 24 – 32 GB | Strong local coding on a workstation or GB10 | Hardware-bound |
2.7 Gemma 3 — general / multimodal
Section titled “2.7 Gemma 3 — general / multimodal”A Google general-purpose family. Listed by Ollama in 270M, 1B, 4B, 12B, 27B. The 4B, 12B and 27B variants accept text and image input.
| Tag | Parameters | Disk | Suggested RAM | Suggested VRAM | Best for | Limitation |
|---|---|---|---|---|---|---|
gemma3:1b | 1 B | ~815 MB | 4 GB | 1 – 2 GB | Lightweight general AI demonstration | Not for coding |
gemma3:4b | 4 B | ~3.3 GB | 8 – 12 GB | 4 – 6 GB | Chat, summarization, multimodal text + image examples | Not a coding-specialized model |
gemma3:12b | 12 B | ~8.1 GB | 20 – 24 GB | 10 – 16 GB | Stronger general reasoning, multimodal demonstrations | Not the first choice for pure Java |
gemma3:27b | 27 B | ~17 GB | 40 – 64 GB | 20 – 32 GB | Advanced multimodal demonstrations on a GB10 | Workstation-bound |
2.8 Mistral / Mixtral
Section titled “2.8 Mistral / Mixtral”A French open family. Ollama lists mistral as a 7B model around 4.4 GB with a 32 K context window; mixtral is a Mixture-of-Experts family with 8x7B and 8x22B variants.
| Tag | Parameters | Disk | Suggested RAM | Suggested VRAM | Best for | Limitation |
|---|---|---|---|---|---|---|
mistral:7b | 7 B | ~4.4 GB | 12 – 16 GB | 6 – 8 GB | Fast general chat, summarization, basic coding | Not as specialised for code as Qwen-Coder or StarCoder2 |
mixtral:8x7b | Mixture-of-Experts | ~26 GB | 64 – 96 GB | 32 – 48 GB | Advanced benchmark on a workstation or GB10 | ”8x7B” does not mean 56 B active simultaneously, but the file footprint is much heavier than a plain 7B |
mixtral:8x22b | Mixture-of-Experts | ~80 GB | 128 GB+ | 80 GB+ or multi-GPU | Highest end of the family | For specialised infrastructure only |
2.9 CodeLlama — historical context
Section titled “2.9 CodeLlama — historical context”Meta’s earlier code-focused family. Listed by Ollama in 7B, 13B, 34B, 70B. It can generate and discuss code, but newer specialised families (Qwen-Coder, CodeGemma, StarCoder2, Granite-Code) usually outperform it on Ollama benchmarks.
| Tag | Parameters | Use in 2026 |
|---|---|---|
codellama:7b | 7 B | Historical reference; useful to show the progress of code LLMs |
codellama:13b | 13 B | Historical reference; not the recommended default |
codellama:34b / :70b | 34 B / 70 B | Benchmark or historical comparison only |
3. Workshop recommendations by hardware tier
Section titled “3. Workshop recommendations by hardware tier”The same idea as in chapter 05b, expanded across the full catalogue.
For a live exercise where every participant must run the model locally:
qwen2.5-coder:3b— fast, small, runs on most laptops; good baseline for code completion.qwen2.5-coder:7b— best practical compromise on 16 GB RAM; documented alternative for the agent demos.mistral:7b— fast general chat and summarization.starcoder2:3b— second code-family baseline for comparison.gemma3:4b— adds a multimodal angle (text + image) at low cost.
Comparable on this tier: qwen2.5-coder:3b vs starcoder2:3b vs granite-code:3b. Same task, three families.
For participants with a workstation laptop or a desktop with a discrete GPU:
qwen2.5-coder:14b— visibly stronger code quality than the 7B.codegemma:7b— useful side-by-side againstqwen2.5-coder:7b.starcoder2:7b— third code family for the comparison.granite-code:8b— IBM professional code-intelligence variant with 128 K context.deepseek-r1:14b— reasoning-focused, useful to show planning and debugging logic.
Comparable on this tier: qwen2.5-coder:7b vs codegemma:7b vs starcoder2:7b vs granite-code:8b. Four code families at similar parameter counts.
For a department machine built around the Grace Blackwell GB10 superchip (128 GB unified memory, around 1 PFLOP in FP4):
qwen2.5-coder:32b— advanced coding benchmark on the same task as the 7B and 14B; visible quality jump.deepseek-r1:32b— reasoning-first benchmark; compare side-by-side withqwen2.5-coder:32b.granite-code:20bandgranite-code:34b— IBM at workstation scale.llama3.1:70b— explicitly cited as “excellent” in the demo 4 README; drop-in upgrade ofllama3.1:8b.mixtral:8x7b— Mixture-of-Experts comparison.
Be careful with the very largest tags. llama3.1:405b (about 243 GB) and deepseek-r1:671b (about 404 GB) do not fit in 128 GB of unified memory without aggressive quantization or special infrastructure. They belong in the conceptual section of the workshop, not in the guaranteed setup.
4. A model is not an agent by itself
Section titled “4. A model is not an agent by itself”A point worth making explicit in any workshop. The model provides language and reasoning capacity. It does not by itself read files, run commands, call APIs, or compile Java code. Those actions come from the agent framework wrapped around the model.
| Layer | What it does | Examples |
|---|---|---|
| Local runtime | Loads the model into memory and exposes a chat / completion API on 127.0.0.1. | Ollama, llama.cpp, vLLM, LM Studio |
| Tool-calling protocol | Defines how the model declares “I want to call read_file with these arguments”. | Ollama’s native tool_calls field, OpenAI-style function calling |
| Agent framework | Implements the loop (call model → execute tool → feed result back → repeat) and the sandbox in which tools run. | LangChain, LangGraph, OpenWebUI, Continue, OpenCode, the custom Python loop in ollama-demo-3-agent-java/agent_java.py |
| Integration / UI | Exposes the agent to a user, often as a chat panel, editor extension, or web UI. | Continue (VS Code), OpenWebUI, Streamlit (used in demos 1, 2, 4) |
The course demos use Ollama for the runtime, Ollama’s native tool calling for the protocol, and a custom Python loop for the agent framework. There is no LangChain, no LangGraph, no Continue. Reading agent_java.py is sufficient to understand the whole stack.
5. A reproducible benchmark protocol for a GB10-class machine
Section titled “5. A reproducible benchmark protocol for a GB10-class machine”If a department has access to a Dell Pro Max with GB10 (or an equivalent DGX Spark class machine), a useful contribution to the workshop is a side-by-side benchmark on the same Java agent task as demos 3 and 4. The protocol below is reproducible and yields directly comparable numbers.
5.1 Inputs
Section titled “5.1 Inputs”- The identical prompt of demo 3 (creating
Product.java,ProductManager.java,Main.java, then compiling). - The identical tool set (
list_files,read_file,write_file,compile_java). - The identical loop (
MAX_STEPS = 10, fallback parser enabled). - A list of candidate models spanning size and family — for example:
qwen2.5-coder:3b,:7b,:14b,:32b;llama3.1:8b,:70b;granite-code:8b,:20b,:34b;deepseek-r1:14b,:32b;mixtral:8x7b.
5.2 Metrics to capture, per model
Section titled “5.2 Metrics to capture, per model”| Metric | How to measure |
|---|---|
| Time to first token | Wall-clock between request and first response chunk |
| Tokens per second | Total output tokens / generation time |
| Number of agent steps | How many model turns before the task is solved or MAX_STEPS is reached |
Structured tool_calls ratio | Calls in the structured field / total parsed calls (the rest comes from the fallback parser) |
| Files created on first try | 0, 1, 2 or 3 of the expected .java files |
| Compilation success | javac returns 0 |
| Peak RAM and VRAM | nvidia-smi and top / htop samples during the run |
| Disk footprint | Output of ollama show <model> |
5.3 Reporting
Section titled “5.3 Reporting”A single CSV file with one row per model, plus a short observation note per row (typical failure mode, code style, anything that is not a number). Three plots cover the rest:
- Tokens per second vs. parameter count.
- Files-created ratio vs. parameter count, split by family.
- Compilation success rate vs. parameter count.
5.4 Pedagogical use
Section titled “5.4 Pedagogical use”The same Java prompt running on qwen2.5-coder:7b, qwen2.5-coder:32b and llama3.1:70b produces three observably different outputs. Showing the three outputs side by side — with the timing and the file-creation ratio — illustrates the trade-off between size, specialization, and speed more clearly than any abstract explanation.
Key takeaways
Section titled “Key takeaways”- B = billion parameters, not gigabytes. Disk size depends on quantization.
- Four memory concepts to keep separate: file size on disk, RAM (CPU mode), VRAM (GPU mode), unified memory (GB10, Apple Silicon).
- Quantization (Q4 / Q8 / FP16 / BF16) controls the trade-off between footprint and quality.
- No single model fits every use. Code-specialised models (
qwen2.5-coder,codegemma,starcoder2,granite-code) shine on Java; general models (llama3.1,mistral,gemma3) shine on chat, summarisation and reasoning; reasoning-focused models (deepseek-r1) shine on multi-step logic. - Three workshop tiers — standard laptops, stronger participant machines, GB10-class workstations — each have their own short list. Pick by hardware first, by task second.
- A model is not an agent by itself. Tool use, file operations and command execution come from the agent framework (Ollama-native, LangChain, LangGraph, OpenWebUI, custom Python). The model only provides language and reasoning.
- The 405B and 671B tags are conceptual. Their standard Ollama formats exceed a 128 GB unified-memory machine; mention them, do not promise to run them.