Choosing a local model

Duration: 8 min Prerequisites: chapter 05a

Three models were validated during development of the agent demos (3 and 4):

llama3.1:8b — the course default, recommended for first runs. Best tool-calling reliability measured (4/4 structured tool_calls in one turn, 3 of 3 files created — see the demo README journal of attempts).
llama3.1:70b — excellent on a workstation (cited in the demo 4 README). Same tool-calling behaviour as the 8B, with stronger code quality. Needs 48+ GB VRAM or unified memory.
qwen2.5-coder:7b — the code-specialized alternative. Produces higher-quality Java, but emits its tool calls inside message.content instead of the structured tool_calls field. The demo code includes a fallback JSON parser (parse_pseudo_tool_calls in agent_java.py) specifically so this model remains a viable swap — at the cost of slower runs and 2/3 file reliability on the canonical test.

Other models named in this chapter and in chapter 05a are mentioned for orders of magnitude and pedagogical context but were not validated on the agent demos. The journal of evaluated-then-rejected configurations (qwen2.5-coder:3b, llama3.2:3b, etc.) is documented in the README of ollama-demo-3-agent-java.

Key idea

Now that we’re running locally (chapter 05a), the real question is: which model? For an agent that uses tools reliably, you need a model fine-tuned for tool calling. Size is not what matters most: fine-tuning is. The course default is llama3.1:8b — but qwen2.5-coder:7b is a fully supported alternative (via the fallback parser), and llama3.1:70b is the recommended uplift when a workstation is available.

What each model can and cannot do — explicit summary

The table below is the single source of truth for “should I use this model on the course demos?”. Three rows are highlighted: those are the models the demo code (agent_java.py) was tested with.

Model	Tested on the demos?	What it can do	What it cannot do	Use it on the course demos?
`llama3.1:8b`	YES — course default	Emit clean structured `tool_calls` (4/4 in one turn on the canonical task), follow the rules of a system prompt, generate small Java files, run on a laptop with 8 GB RAM	Rival a 70B on complex reasoning, produce flawless production-grade code on the first try, manage large multi-file projects	YES — the course reference. The safest pick.
`qwen2.5-coder:7b`	YES — documented code-specialized alternative	Write higher-quality Java code than `llama3.1:8b` on the same prompt; tool calling via the fallback parser `parse_pseudo_tool_calls` (included in the demo code precisely for this model)	Emit structured `tool_calls` (the JSON lands in `message.content` instead of the structured field, and Java quote-escaping is sometimes broken — recap from the demo README: 2 of 3 files created reliably on the canonical task)	YES, with caveats. Pick it if Java code quality matters more than tool-call cleanliness. Slower (fallback parser path).
`llama3.1:70b`	YES — on workstation (cited in demo 4 README as “excellent aussi”)	Strong uplift in instruction-following, near-perfect Java, robust structured tool calling	Run on a laptop — needs 48+ GB VRAM or unified memory	YES if you have the hardware. Drop-in upgrade of `llama3.1:8b` for showing qualitative jumps.
`qwen2.5-coder:14b`	Not tested (evaluated only)	Better Java than the 7b sibling, code-specialized	Run on standard hardware (needs 12+ GB VRAM); tool-calling format same issue as the 7b (no structured `tool_calls`)	NO unless you have the hardware AND accept the fallback parser caveat
`qwen2.5-coder:3b` / `:0.5b`	Evaluated and rejected (attempts 1 and 2 of the journal)	Run on very low RAM	Reliable tool calling — pure text output, no JSON at all on the agent task	NO — too small for the structured tool-calling protocol
`llama3.2:3b` / `:latest` (3B)	Evaluated and rejected (attempt 4)	Run on 4 GB RAM	Tool calling — produces malformed JSON (pseudo-tag syntax) that even the fallback parser cannot recover	NO — fails both the structured and the fallback path
`llama3.1:405b`, `llama4`, `grok-2` (200 B+)	Not tested (Pro workstation hardware required)	Approach commercial-grade quality 100% locally	Run anywhere except a DGX Spark / Dell Pro Max GB10 class machine (128 GB unified)	NO for standard course delivery — out of hardware scope
`mistral:7b`, `command-r:7b`, `firefunction-v2`	Not tested	Documented tool-calling support on the Ollama library	Unknown behaviour on the canonical task — not benchmarked	NO — substitution untested
`gemma:2b`, `phi3`, all “base” models	Not tested	General chat	Reliable structured `tool_calls`	NO — not fine-tuned for tools

One-line summary: for the agent demos, the validated set is llama3.1:8b (default), qwen2.5-coder:7b (code-specialized alternative via the fallback parser), and llama3.1:70b (workstation uplift). Anything else, you are pioneering.

The models evaluated for the repo

Model	Size	Disk	RAM/VRAM	Tool calling	Java code quality	Verdict
`qwen2.5-coder:0.5b`	0.5 B	~400 MB	2 GB	non-existent	basic	toy
`qwen2.5-coder:3b`	3 B	~1.9 GB	4 GB	very unstable	good	personal demo
`qwen2.5-coder:7b`	7 B	~4.7 GB	8 GB	unstable* — works via fallback parser	excellent	documented code-specialized alternative
`llama3.1:8b`	8 B	~4.9 GB	8 GB	reliable (structured)	good	course default — our recommendation
`qwen2.5-coder:14b`	14 B	~9 GB	12 GB	unstable	very good	too heavy without GPU
`llama3.2:3b`	3 B	~2 GB	4 GB	acceptable	average	if low on RAM

* “Unstable on tool calling” means: Qwen produces JSON for tool calls but with poorly-escaped Java strings, which breaks json.loads. The comment at the top of ollama-demo-3-agent-java/agent_java.py explains the choice precisely:

Why llama3.1:8b and not qwen2.5-coder? Qwen2.5-Coder writes great Java but does NOT reliably populate the structured tool_calls field of Ollama’s response: it emits JSON inside message.content with often-malformed escapes for embedded Java strings, which breaks json.loads.

The rule that surprises beginners

It’s not “bigger is better”. For tool calling:

qwen2.5-coder:14b    Perfect Java, broken tool calling      NO (without fallback parser)
llama3.1:8b          OK Java, clean tool calling            YES — course default
qwen2.5-coder:7b     Excellent Java, broken tool calling    YES via fallback parser (alternative)
qwen2.5-coder:3b     Decent Java, broken tool calling       NO
llama3.2:3b          Average Java, acceptable tool calling  MAYBE

Counter-intuitive, but consistent. Llama 3.1 was specifically trained by Meta to emit structured tool_calls in JSON, on a separate channel from the text reply. Qwen2.5-Coder was trained to write code; it intellectually knows there’s a tool-call format to respect, but it slips in unescaped quotes and its output stops being parsable. See also the fallback parser (parse_pseudo_tool_calls) we had to add in demo 3 to catch those broken outputs.

How to tell if a model “can” do tool calling

The https://ollama.com/library page tags each model with its capabilities. Look for the tools mention on a model’s card. Known to do reliable tool calling:

llama3.1:8b, llama3.1:70b
llama3.2:3b
mistral:7b (recent versions)
qwen2.5:7b / qwen2.5:14b (the generalist version, not Coder)
firefunction-v2
command-r:7b

To avoid for tool calling:

qwen2.5-coder:* (any size) — excellent for chat code generation, bad at the tool-call format;
gemma:2b, phi3 — not (or poorly) fine-tuned for tools;
all “base” models (no instruct/chat suffix).

How it’s wired in the code

In ollama-demo-4-trio-agents-java/agent.py:

MODEL_NAME = "llama3.1:8b"

A single constant. To test another model, change the line, run ollama pull <new_model>, and rerun. The agent logic depends on nothing model-specific, except that the model must produce clean tool_calls.

In demo 4, the Streamlit UI doesn’t (yet) expose this choice in the sidebar — that’s something we could add as an exercise (chapter 13). The model is shown read-only in the configuration panel.

Machine recommendations by model size

Two complementary views of the same question — given my model, what hardware do I need? and given my hardware, what model can I run?

View 1 — pivot table: model → hardware tier

All values assume Ollama’s default Q4_K_M quantization (chapter 05a). 8b = 8 billion parameters; 70b = 70 billion; 405b = 405 billion.

Model	Disk	RAM at runtime	VRAM	Hardware tier	Example machine
`llama3.2:3b`	~2 GB	~3 GB	optional	Entry-level laptop	Any laptop with 8 GB RAM
`llama3.1:8b` (course default)	~5 GB	~6 GB	optional (×3 speed-up with 8 GB VRAM)	Standard laptop / desktop	16 GB RAM, recent i5 / i7 / Ryzen 5
`qwen2.5:7b`, `qwen2.5:14b`	~5 – 9 GB	~6 – 10 GB	recommended 8 – 12 GB VRAM	Standard or gaming desktop	32 GB RAM + RTX 3060 / 4060
`qwen2.5-coder:14b`	~9 GB	~10 GB	required 12 GB VRAM (otherwise CPU-only, very slow)	Workstation / gaming PC	32 GB RAM + RTX 4070 12 GB
`llama3.3:70b`, `llama3.1:70b`	~40 GB	~48 GB	required 48 GB VRAM or 64 GB+ unified RAM	High-end workstation	64 GB RAM + RTX A6000 / 6000 Ada, or Mac M3 Max 64 – 128 GB
`llama3.1:405b`	~230 GB	~250 GB (needs unified memory)	not feasible on consumer hardware	Pro workstation / lab	DGX Spark / Dell Pro Max GB10 (128 GB unified) and equivalent classes
`llama4`, `grok-2` (200 B+ open weights)	~120 – 300 GB	~150 GB+ unified	not feasible	Pro workstation / lab	DGX Spark / Dell Pro Max GB10 and equivalent classes

Unified memory matters. On a DGX Spark, a Dell Pro Max with GB10, or an Apple Silicon Mac, CPU and GPU share the same memory pool without copy. A 70 B model that needs 48 GB simply does not fit in a 24 GB RTX 4090 — but fits comfortably in a 128 GB unified GB10 or M3 Max. Above ~30 B parameters, unified memory becomes the dominant constraint.

View 2 — by machine class

What runs comfortably:

llama3.2:3b (Q4) — ~3 GB RAM, fast.
llama3.1:8b (Q4) — ~6 GB RAM, the course default. Around 5 – 15 tokens/s on a recent CPU.
qwen2.5-coder:7b (Q4) — ~6 GB RAM, the code-specialized tested alternative. Same speed range as llama3.1:8b. Pick it when Java output quality matters; the demo’s fallback parser handles its non-structured tool calls.

What is too heavy:

14 B models in CPU-only mode — usable but slow (1 – 3 tokens/s), painful for a live demo.
All 70 B+ models — they will swap and grind for several seconds per token.

Recommendation for the course: start with llama3.1:8b (default) and, if Java code quality is the focus, switch the demo-3/4 MODEL_NAME to qwen2.5-coder:7b. Both were validated on this hardware tier.

What runs comfortably:

All 3 – 8 B models — fully on GPU, 30 – 60 tokens/s. Includes llama3.1:8b (default) and qwen2.5-coder:7b (code-specialized alternative).
qwen2.5:14b and qwen2.5-coder:14b (Q4) — fits in 12 GB VRAM, ~20 – 30 tokens/s. Good for code-generation experiments outside of strict tool calling.

What is too heavy:

70 B models — VRAM too tight even in Q4; CPU offload pulls speed down to 1 – 5 tokens/s.

Recommendation for the course: llama3.1:8b for the canonical run; qwen2.5-coder:7b for the same demos when you want nicer Java (fallback parser kicks in). Free to experiment with 14 B models for side projects, but expect the same tool-call format issue scaled up.

What runs comfortably:

All 3 – 14 B models — instantaneous (50 – 100+ tokens/s).
llama3.1:70b (Q4) — fits, ~10 – 25 tokens/s depending on hardware. Big jump in instruction-following and reasoning quality.

What is still out of reach:

200 B+ models (Llama 4, Grok-2, Llama 3.1 405B) — even in Q4 they exceed 100 GB of memory.

Recommendation for the course: ideal for showing in class the qualitative gap between llama3.1:8b and llama3.1:70b on the same agent task — same prompt, same tools, very different code quality.

Reference machines: NVIDIA DGX Spark, Dell Pro Max with GB10, and equivalent platforms built around the Grace Blackwell GB10 superchip — 128 GB unified memory, around 1 PFLOP in FP4.

What becomes possible:

All 70 B models in Q8 or FP16 — near-research quality at the desk.
llama3.1:405b (Q4) — runs locally for the first time on a desktop-class machine.
llama4 and grok-2 open-weights releases (200 B+) — runnable locally in Q4.
Multi-agent setups where two or three 14 B agents live in memory simultaneously without swap.

Recommendation for the course: if such a machine is available in a lab or department, the most pedagogically valuable demonstration is to run exactly the same agent prompt of demo 3 on llama3.1:8b, llama3.1:70b, and llama3.1:405b side by side, and let students see — with their own eyes — the qualitative jump (and the diminishing returns at the top).

Classroom recommendation

Configuration	Recommended model	Alternative tested	Why
Standard PC 8 GB RAM, no GPU	`llama3.1:8b`	`qwen2.5-coder:7b` (with fallback parser)	Course default, balanced RAM / quality. `qwen2.5-coder:7b` is the code-specialized alternative when you want better Java output and accept slower runs.
PC 16 GB RAM, no GPU	`llama3.1:8b`	`qwen2.5-coder:7b`	Same as above, smoother.
PC with NVIDIA 8 GB VRAM GPU	`llama3.1:8b`	`qwen2.5-coder:7b` or `qwen2.5:14b`	You can go bigger; `llama3.1:8b` keeps tool calling clean, the qwen-coder line wins on Java quality.
High-end workstation (48+ GB VRAM or unified)	`llama3.1:70b`	`llama3.1:8b` for the canonical run	`llama3.1:70b` is explicitly cited as “excellent” in the demo 4 README. Drop-in upgrade.
Light laptop 4 GB RAM	`llama3.2:3b`	—	Only viable choice; quality is visibly degraded; tool calling is “acceptable” but not “reliable”.

For a demo 4 shown to a teacher, stick with llama3.1:8b. That’s the configuration tested end-to-end with the 8 menu prompts, the Verify agent and the JUnit Tests agent. If you swap it for qwen2.5-coder:7b to showcase nicer Java code, expect slower runs (the fallback parser does extra work) and occasional missing files (recap from the demo README: 2/3 file reliability on the canonical task).

Key takeaways

The course default for the agent demos is llama3.1:8b — the best compromise on 8 GB RAM and the only model with consistently structured tool_calls in our measurements.
The documented code-specialized alternative is qwen2.5-coder:7b. It writes higher-quality Java, but emits its tool calls inside message.content, which is why the demo code includes the parse_pseudo_tool_calls fallback parser — explicitly so this model remains a viable swap.
The workstation upgrade is llama3.1:70b — explicitly cited as “excellent” in the demo 4 README.
For tool calling, size matters less than fine-tuning. qwen2.5-coder:14b is bigger than llama3.1:8b but its tool-call output is worse.
Check the tools mention on the model’s page in the Ollama library before assuming any model can do tool calling — and read the journal of attempts in the demo 3 README to see how that capability label was not enough to guarantee structured output for qwen-coder.
To switch model in our demos: one line in agent.py, one ollama pull, done.
Unified memory (DGX Spark, Dell Pro Max GB10, Apple Silicon) is the gating factor above ~30 B parameters — separate RAM and VRAM cannot scale to 200 B+ models, no matter how much you stack.