Skip to content

Annex — Local LLM catalogue

Duration: 20 min Pre-requisite: chapter 05b

Chapter 05b is deliberately focused: it lists the three models validated on the course demos and tells the reader which one to pick by hardware class. This annex zooms out. It is meant as a reference catalogue for anyone who wants to understand the rest of the local LLM landscape — what “32B” really means, how qwen2.5-coder compares to granite-code or starcoder2, and what becomes possible on a workstation-class machine like a Dell Pro Max with GB10 (a Grace Blackwell GB10 platform, hardware-equivalent to an NVIDIA DGX Spark).

The annex is organised in four blocks:

  1. Vocabulary — what “B”, “Q4”, “RAM”, “VRAM”, “unified memory” actually mean.
  2. The catalogue — family by family, with parameter count, disk size, suggested RAM and VRAM.
  3. Three workshop tiers — what runs on a standard laptop, a stronger desktop, and a GB10-class machine.
  4. What a model is not — a clarification that a model alone is not an agent: tool use depends on the surrounding framework.

1. Vocabulary: what the model name actually tells you

Section titled “1. Vocabulary: what the model name actually tells you”

1.1 “B” means billion parameters — not gigabytes

Section titled “1.1 “B” means billion parameters — not gigabytes”

A common confusion in classrooms is to read qwen2.5-coder:32b and assume “32 GB on disk”. That is not what the name means. The B stands for billion parameters.

Name suffixApproximate parameter count
0.5b0.5 billion parameters (5 × 10⁸)
1.5b1.5 billion parameters
3b3 billion parameters
7b7 billion parameters
8b8 billion parameters
14b14 billion parameters
20b20 billion parameters
32b32 billion parameters
70b70 billion parameters
405b405 billion parameters
671b671 billion parameters

The parameters are the internal learned weights of the model. More parameters generally mean better reasoning, better code, more stable instruction-following — at the cost of more memory, more storage and more compute time.

When a model card says “needs 8 GB”, that figure can refer to four different things. Mixing them up leads to wrong purchasing decisions.

ConceptWhat it isWhen it matters
File size on diskThe size of the .gguf file Ollama downloads. Around 4.7 GB for llama3.1:8b (Q4).When you ollama pull — disk space and download time.
RAMSystem memory used when the model runs on the CPU. Roughly file size + 1 – 2 GB for context and runtime.On a laptop with no usable GPU.
VRAMGPU memory used when the model runs on a discrete graphics card. Same order of magnitude as RAM, but read/write is much faster.When a CUDA/ROCm GPU is present.
Unified memoryA single memory pool shared by CPU and GPU without copy. Found on Apple Silicon, DGX Spark, Dell Pro Max GB10.When the model is too big to fit in any single VRAM card but still fits in the shared pool.

A 70B model that needs about 48 GB simply does not fit in a 24 GB RTX 4090 — but fits comfortably in 128 GB of unified memory on a GB10 or M3 Max. Above roughly 30 B parameters, unified memory becomes the dominant constraint.

1.3 Quantization — Q4, Q8, FP16, BF16 (summary)

Section titled “1.3 Quantization — Q4, Q8, FP16, BF16 (summary)”

The full precision of a model is FP16 (16 bits per weight) or BF16. Quantization compresses each weight to fewer bits, which shrinks the file and reduces RAM/VRAM use at a measurable cost in quality.

QuantizationBits per weightFootprint (relative to FP16)Quality costTypical use
Q4 (Q4_K_M is Ollama’s default)~4 bits~25 %Small, often imperceptibleLaptops, classroom
Q5 / Q6~5 – 6 bits~35 – 45 %Very smallMid-range desktops
Q8~8 bits~50 %Almost none on most tasksWorkstation, GPU >= 16 GB
FP16 / BF1616 bits100 %None (reference)Research, fine-tuning

A qwen2.5-coder:32b in Q4 (around 20 GB on disk) is reachable on a workstation with 48 GB of RAM, while the same model in FP16 (about 64 GB) is not.


All disk sizes refer to Ollama’s default tag (Q4_K_M unless otherwise stated). Suggested RAM and VRAM values are practical floors for comfortable live use — they include context, runtime overhead and a small safety margin.

2.1 Qwen2.5-Coder — code-specialized family

Section titled “2.1 Qwen2.5-Coder — code-specialized family”

A code-specialized line from Alibaba. Available on Ollama in 0.5B, 1.5B, 3B, 7B, 14B, 32B. The 7B and 14B variants advertise the tools capability on the Ollama library page but emit their tool calls inside message.content rather than the structured field (see chapter 05b for the consequence on the agent demos).

TagParametersDiskSuggested RAMSuggested VRAMBest forLimitation
qwen2.5-coder:0.5b0.5 B~400 MB2 – 4 GB1 – 2 GBDemonstrating that a local LLM can run almost anywhereToo small for serious Java work or for the agent demos
qwen2.5-coder:1.5b1.5 B~1.0 GB4 GB2 GBTiny code-completion demosLoses coherence on multi-file projects
qwen2.5-coder:3b3 B~1.9 GB6 – 8 GB3 – 4 GBStandard laptop baseline; basic Java examples, error explanationDrifts on larger tasks
qwen2.5-coder:7b7 B~4.7 GB12 – 16 GB6 – 8 GBDocumented alternative to llama3.1:8b for the agent demos (chapter 05b) — best practical balance for participants with decent machinesTool calls land in message.content; needs the fallback parser of the demo code
qwen2.5-coder:14b14 B~9.0 GB24 – 32 GB12 – 16 GBStronger Java; visible quality jump over the 7BSlow on CPU-only, same tool-call format issue
qwen2.5-coder:32b32 B~20 GB48 – 64 GB24 – 32 GBAdvanced coding benchmark, comparison with cloud toolingOut of reach for laptops; realistic on a GB10-class machine

2.2 Llama 3.1 — general-purpose, reliable tool calling

Section titled “2.2 Llama 3.1 — general-purpose, reliable tool calling”

Meta’s general-purpose family. Available on Ollama in 8B, 70B, 405B. Default 8B tag around 4.9 GB with a 128 K context window.

TagParametersDiskSuggested RAMSuggested VRAMBest forLimitation
llama3.1:8b8 B~4.9 GB12 – 16 GB6 – 8 GBCourse default for the agent demos — clean structured tool_calls, balanced quality on chat, reasoning, summarizationLess specialized for code than Qwen-Coder; good comparison point against a coding model
llama3.1:70b70 B~43 GB96 – 128 GB48 – 80 GBCited as “excellent” in the demo 4 README for workstation users; stronger planning and long answersNot a laptop model
llama3.1:405b405 B~243 GB (standard format)300 GB+ in a classical configurationVery high; needs research-grade infrastructureShowing the gap between local and frontier scalesEven a 128 GB GB10 cannot run the standard Ollama tag without aggressive quantization or special setup. Mention it conceptually rather than promise it.

2.3 DeepSeek-R1 — reasoning-focused family

Section titled “2.3 DeepSeek-R1 — reasoning-focused family”

A reasoning-oriented family. Ollama lists 1.5B, 7B, 8B, 14B, 32B, 70B, 671B.

TagParametersDiskSuggested RAMSuggested VRAMBest forLimitation
deepseek-r1:1.5b1.5 B~1.1 GB4 GB2 GBShowing step-by-step reasoning on a weak machineToo limited for production coding
deepseek-r1:7b / :8b7 – 8 B~4.7 GB12 – 16 GB6 – 8 GBComparing a reasoning-focused 7B against a coding-focused 7BLess precise than Qwen-Coder for pure Java generation
deepseek-r1:14b14 B~9.0 GB24 – 32 GB12 – 16 GBPlanning, architecture discussion, debugging logicHeavier; slower without GPU
deepseek-r1:32b32 B~20 GB48 – 64 GB24 – 32 GBAdvanced reasoning benchmark on a GB10-class machineOut of reach for laptops
deepseek-r1:70b70 B~43 GB96 – 128 GB48 – 80 GBComparing advanced reasoning vs. specialised coding modelsWorkstation-only
deepseek-r1:671b671 B~404 GBFar above 128 GB unified memoryResearch-grade infrastructureConceptual reference for the absolute top of the familyNot a realistic Ollama target, even on a GB10

A coding-oriented line from Google’s Gemma family. Listed by Ollama in 2B and 7B, supporting fill-in-the-middle completion, code generation, instruction following.

TagParametersDiskSuggested RAMSuggested VRAMBest forLimitation
codegemma:2b2 B~1.6 GB4 – 6 GB2 – 4 GBCode completion on weak machines, small examplesNot strong enough for complex agent workflows
codegemma:7b7 B~5.0 GB12 – 16 GB6 – 8 GBCode completion, generation, instruction-following; useful comparison against qwen2.5-coder:7bSmaller context than the most recent families

A code-focused open family. Listed by Ollama in 3B, 7B, 15B, with a 16 K context window.

TagParametersDiskSuggested RAMSuggested VRAMBest forLimitation
starcoder2:3b3 B~1.7 GB6 – 8 GB3 – 4 GBDemonstrating a code-only family on a small machineNot ideal for big Java projects
starcoder2:7b7 B~4.0 GB12 – 16 GB6 – 8 GBComparison across code-LLM familiesLess conversational than modern instruct-tuned chat models
starcoder2:15b15 B~9.1 GB24 – 32 GB12 – 16 GBAdvanced code generation benchmarkHeavy; not for weak laptops

2.6 Granite-Code — IBM’s professional code line

Section titled “2.6 Granite-Code — IBM’s professional code line”

IBM’s code-intelligence family. Listed by Ollama in 3B, 8B, 20B, 34B, with the 3B and 8B variants advertising a 128 K context window.

TagParametersDiskSuggested RAMSuggested VRAMBest forLimitation
granite-code:3b3 B~2.0 GB6 – 8 GB3 – 4 GBCode generation, code explanation, code fixing; second baseline next to qwen2.5-coder:3bLimited reasoning depth
granite-code:8b8 B~4.6 GB12 – 16 GB6 – 8 GBProfessional code-intelligence scenarios, long contextComparison only; not the recommended agent demo model
granite-code:20b20 B~12 GB32 – 48 GB16 – 24 GBAdvanced code-generation benchmarkOut of reach for laptops
granite-code:34b34 B~19 GB48 – 64 GB24 – 32 GBStrong local coding on a workstation or GB10Hardware-bound

A Google general-purpose family. Listed by Ollama in 270M, 1B, 4B, 12B, 27B. The 4B, 12B and 27B variants accept text and image input.

TagParametersDiskSuggested RAMSuggested VRAMBest forLimitation
gemma3:1b1 B~815 MB4 GB1 – 2 GBLightweight general AI demonstrationNot for coding
gemma3:4b4 B~3.3 GB8 – 12 GB4 – 6 GBChat, summarization, multimodal text + image examplesNot a coding-specialized model
gemma3:12b12 B~8.1 GB20 – 24 GB10 – 16 GBStronger general reasoning, multimodal demonstrationsNot the first choice for pure Java
gemma3:27b27 B~17 GB40 – 64 GB20 – 32 GBAdvanced multimodal demonstrations on a GB10Workstation-bound

A French open family. Ollama lists mistral as a 7B model around 4.4 GB with a 32 K context window; mixtral is a Mixture-of-Experts family with 8x7B and 8x22B variants.

TagParametersDiskSuggested RAMSuggested VRAMBest forLimitation
mistral:7b7 B~4.4 GB12 – 16 GB6 – 8 GBFast general chat, summarization, basic codingNot as specialised for code as Qwen-Coder or StarCoder2
mixtral:8x7bMixture-of-Experts~26 GB64 – 96 GB32 – 48 GBAdvanced benchmark on a workstation or GB10”8x7B” does not mean 56 B active simultaneously, but the file footprint is much heavier than a plain 7B
mixtral:8x22bMixture-of-Experts~80 GB128 GB+80 GB+ or multi-GPUHighest end of the familyFor specialised infrastructure only

Meta’s earlier code-focused family. Listed by Ollama in 7B, 13B, 34B, 70B. It can generate and discuss code, but newer specialised families (Qwen-Coder, CodeGemma, StarCoder2, Granite-Code) usually outperform it on Ollama benchmarks.

TagParametersUse in 2026
codellama:7b7 BHistorical reference; useful to show the progress of code LLMs
codellama:13b13 BHistorical reference; not the recommended default
codellama:34b / :70b34 B / 70 BBenchmark or historical comparison only

3. Workshop recommendations by hardware tier

Section titled “3. Workshop recommendations by hardware tier”

The same idea as in chapter 05b, expanded across the full catalogue.

For a live exercise where every participant must run the model locally:

  • qwen2.5-coder:3b — fast, small, runs on most laptops; good baseline for code completion.
  • qwen2.5-coder:7b — best practical compromise on 16 GB RAM; documented alternative for the agent demos.
  • mistral:7b — fast general chat and summarization.
  • starcoder2:3b — second code-family baseline for comparison.
  • gemma3:4b — adds a multimodal angle (text + image) at low cost.

Comparable on this tier: qwen2.5-coder:3b vs starcoder2:3b vs granite-code:3b. Same task, three families.


A point worth making explicit in any workshop. The model provides language and reasoning capacity. It does not by itself read files, run commands, call APIs, or compile Java code. Those actions come from the agent framework wrapped around the model.

LayerWhat it doesExamples
Local runtimeLoads the model into memory and exposes a chat / completion API on 127.0.0.1.Ollama, llama.cpp, vLLM, LM Studio
Tool-calling protocolDefines how the model declares “I want to call read_file with these arguments”.Ollama’s native tool_calls field, OpenAI-style function calling
Agent frameworkImplements the loop (call model → execute tool → feed result back → repeat) and the sandbox in which tools run.LangChain, LangGraph, OpenWebUI, Continue, OpenCode, the custom Python loop in ollama-demo-3-agent-java/agent_java.py
Integration / UIExposes the agent to a user, often as a chat panel, editor extension, or web UI.Continue (VS Code), OpenWebUI, Streamlit (used in demos 1, 2, 4)

The course demos use Ollama for the runtime, Ollama’s native tool calling for the protocol, and a custom Python loop for the agent framework. There is no LangChain, no LangGraph, no Continue. Reading agent_java.py is sufficient to understand the whole stack.


5. A reproducible benchmark protocol for a GB10-class machine

Section titled “5. A reproducible benchmark protocol for a GB10-class machine”

If a department has access to a Dell Pro Max with GB10 (or an equivalent DGX Spark class machine), a useful contribution to the workshop is a side-by-side benchmark on the same Java agent task as demos 3 and 4. The protocol below is reproducible and yields directly comparable numbers.

  • The identical prompt of demo 3 (creating Product.java, ProductManager.java, Main.java, then compiling).
  • The identical tool set (list_files, read_file, write_file, compile_java).
  • The identical loop (MAX_STEPS = 10, fallback parser enabled).
  • A list of candidate models spanning size and family — for example: qwen2.5-coder:3b, :7b, :14b, :32b; llama3.1:8b, :70b; granite-code:8b, :20b, :34b; deepseek-r1:14b, :32b; mixtral:8x7b.
MetricHow to measure
Time to first tokenWall-clock between request and first response chunk
Tokens per secondTotal output tokens / generation time
Number of agent stepsHow many model turns before the task is solved or MAX_STEPS is reached
Structured tool_calls ratioCalls in the structured field / total parsed calls (the rest comes from the fallback parser)
Files created on first try0, 1, 2 or 3 of the expected .java files
Compilation successjavac returns 0
Peak RAM and VRAMnvidia-smi and top / htop samples during the run
Disk footprintOutput of ollama show <model>

A single CSV file with one row per model, plus a short observation note per row (typical failure mode, code style, anything that is not a number). Three plots cover the rest:

  • Tokens per second vs. parameter count.
  • Files-created ratio vs. parameter count, split by family.
  • Compilation success rate vs. parameter count.

The same Java prompt running on qwen2.5-coder:7b, qwen2.5-coder:32b and llama3.1:70b produces three observably different outputs. Showing the three outputs side by side — with the timing and the file-creation ratio — illustrates the trade-off between size, specialization, and speed more clearly than any abstract explanation.


  • B = billion parameters, not gigabytes. Disk size depends on quantization.
  • Four memory concepts to keep separate: file size on disk, RAM (CPU mode), VRAM (GPU mode), unified memory (GB10, Apple Silicon).
  • Quantization (Q4 / Q8 / FP16 / BF16) controls the trade-off between footprint and quality.
  • No single model fits every use. Code-specialised models (qwen2.5-coder, codegemma, starcoder2, granite-code) shine on Java; general models (llama3.1, mistral, gemma3) shine on chat, summarisation and reasoning; reasoning-focused models (deepseek-r1) shine on multi-step logic.
  • Three workshop tiers — standard laptops, stronger participant machines, GB10-class workstations — each have their own short list. Pick by hardware first, by task second.
  • A model is not an agent by itself. Tool use, file operations and command execution come from the agent framework (Ollama-native, LangChain, LangGraph, OpenWebUI, custom Python). The model only provides language and reasoning.
  • The 405B and 671B tags are conceptual. Their standard Ollama formats exceed a 128 GB unified-memory machine; mention them, do not promise to run them.