Choosing a local model
Duration: 8 min Prerequisites: chapter 05a
Key idea
Section titled “Key idea”Now that we’re running locally (chapter 05a), the real question is: which model? For an agent that uses tools reliably, you need a model fine-tuned for tool calling. Size is not what matters most: fine-tuning is. The course default is llama3.1:8b — but qwen2.5-coder:7b is a fully supported alternative (via the fallback parser), and llama3.1:70b is the recommended uplift when a workstation is available.
What each model can and cannot do — explicit summary
Section titled “What each model can and cannot do — explicit summary”The table below is the single source of truth for “should I use this model on the course demos?”. Three rows are highlighted: those are the models the demo code (agent_java.py) was tested with.
| Model | Tested on the demos? | What it can do | What it cannot do | Use it on the course demos? |
|---|---|---|---|---|
llama3.1:8b | YES — course default | Emit clean structured tool_calls (4/4 in one turn on the canonical task), follow the rules of a system prompt, generate small Java files, run on a laptop with 8 GB RAM | Rival a 70B on complex reasoning, produce flawless production-grade code on the first try, manage large multi-file projects | YES — the course reference. The safest pick. |
qwen2.5-coder:7b | YES — documented code-specialized alternative | Write higher-quality Java code than llama3.1:8b on the same prompt; tool calling via the fallback parser parse_pseudo_tool_calls (included in the demo code precisely for this model) | Emit structured tool_calls (the JSON lands in message.content instead of the structured field, and Java quote-escaping is sometimes broken — recap from the demo README: 2 of 3 files created reliably on the canonical task) | YES, with caveats. Pick it if Java code quality matters more than tool-call cleanliness. Slower (fallback parser path). |
llama3.1:70b | YES — on workstation (cited in demo 4 README as “excellent aussi”) | Strong uplift in instruction-following, near-perfect Java, robust structured tool calling | Run on a laptop — needs 48+ GB VRAM or unified memory | YES if you have the hardware. Drop-in upgrade of llama3.1:8b for showing qualitative jumps. |
qwen2.5-coder:14b | Not tested (evaluated only) | Better Java than the 7b sibling, code-specialized | Run on standard hardware (needs 12+ GB VRAM); tool-calling format same issue as the 7b (no structured tool_calls) | NO unless you have the hardware AND accept the fallback parser caveat |
qwen2.5-coder:3b / :0.5b | Evaluated and rejected (attempts 1 and 2 of the journal) | Run on very low RAM | Reliable tool calling — pure text output, no JSON at all on the agent task | NO — too small for the structured tool-calling protocol |
llama3.2:3b / :latest (3B) | Evaluated and rejected (attempt 4) | Run on 4 GB RAM | Tool calling — produces malformed JSON (pseudo-tag syntax) that even the fallback parser cannot recover | NO — fails both the structured and the fallback path |
llama3.1:405b, llama4, grok-2 (200 B+) | Not tested (Pro workstation hardware required) | Approach commercial-grade quality 100% locally | Run anywhere except a DGX Spark / Dell Pro Max GB10 class machine (128 GB unified) | NO for standard course delivery — out of hardware scope |
mistral:7b, command-r:7b, firefunction-v2 | Not tested | Documented tool-calling support on the Ollama library | Unknown behaviour on the canonical task — not benchmarked | NO — substitution untested |
gemma:2b, phi3, all “base” models | Not tested | General chat | Reliable structured tool_calls | NO — not fine-tuned for tools |
One-line summary: for the agent demos, the validated set is llama3.1:8b (default), qwen2.5-coder:7b (code-specialized alternative via the fallback parser), and llama3.1:70b (workstation uplift). Anything else, you are pioneering.
The models evaluated for the repo
Section titled “The models evaluated for the repo”| Model | Size | Disk | RAM/VRAM | Tool calling | Java code quality | Verdict |
|---|---|---|---|---|---|---|
qwen2.5-coder:0.5b | 0.5 B | ~400 MB | 2 GB | non-existent | basic | toy |
qwen2.5-coder:3b | 3 B | ~1.9 GB | 4 GB | very unstable | good | personal demo |
qwen2.5-coder:7b | 7 B | ~4.7 GB | 8 GB | unstable* — works via fallback parser | excellent | documented code-specialized alternative |
llama3.1:8b | 8 B | ~4.9 GB | 8 GB | reliable (structured) | good | course default — our recommendation |
qwen2.5-coder:14b | 14 B | ~9 GB | 12 GB | unstable | very good | too heavy without GPU |
llama3.2:3b | 3 B | ~2 GB | 4 GB | acceptable | average | if low on RAM |
* “Unstable on tool calling” means: Qwen produces JSON for tool calls but with poorly-escaped Java strings, which breaks json.loads. The comment at the top of ollama-demo-3-agent-java/agent_java.py explains the choice precisely:
Why llama3.1:8b and not qwen2.5-coder? Qwen2.5-Coder writes great Java but does NOT reliably populate the structured
tool_callsfield of Ollama’s response: it emits JSON insidemessage.contentwith often-malformed escapes for embedded Java strings, which breaksjson.loads.
The rule that surprises beginners
Section titled “The rule that surprises beginners”It’s not “bigger is better”. For tool calling:
qwen2.5-coder:14b Perfect Java, broken tool calling NO (without fallback parser)llama3.1:8b OK Java, clean tool calling YES — course defaultqwen2.5-coder:7b Excellent Java, broken tool calling YES via fallback parser (alternative)qwen2.5-coder:3b Decent Java, broken tool calling NOllama3.2:3b Average Java, acceptable tool calling MAYBECounter-intuitive, but consistent. Llama 3.1 was specifically trained by Meta to emit structured tool_calls in JSON, on a separate channel from the text reply. Qwen2.5-Coder was trained to write code; it intellectually knows there’s a tool-call format to respect, but it slips in unescaped quotes and its output stops being parsable. See also the fallback parser (parse_pseudo_tool_calls) we had to add in demo 3 to catch those broken outputs.
How to tell if a model “can” do tool calling
Section titled “How to tell if a model “can” do tool calling”The https://ollama.com/library page tags each model with its capabilities. Look for the tools mention on a model’s card. Known to do reliable tool calling:
llama3.1:8b,llama3.1:70bllama3.2:3bmistral:7b(recent versions)qwen2.5:7b/qwen2.5:14b(the generalist version, not Coder)firefunction-v2command-r:7b
To avoid for tool calling:
qwen2.5-coder:*(any size) — excellent for chat code generation, bad at the tool-call format;gemma:2b,phi3— not (or poorly) fine-tuned for tools;- all “base” models (no instruct/chat suffix).
How it’s wired in the code
Section titled “How it’s wired in the code”In ollama-demo-4-trio-agents-java/agent.py:
MODEL_NAME = "llama3.1:8b"A single constant. To test another model, change the line, run ollama pull <new_model>, and rerun. The agent logic depends on nothing model-specific, except that the model must produce clean tool_calls.
In demo 4, the Streamlit UI doesn’t (yet) expose this choice in the sidebar — that’s something we could add as an exercise (chapter 13). The model is shown read-only in the configuration panel.
Machine recommendations by model size
Section titled “Machine recommendations by model size”Two complementary views of the same question — given my model, what hardware do I need? and given my hardware, what model can I run?
View 1 — pivot table: model → hardware tier
Section titled “View 1 — pivot table: model → hardware tier”All values assume Ollama’s default Q4_K_M quantization (chapter 05a). 8b = 8 billion parameters; 70b = 70 billion; 405b = 405 billion.
| Model | Disk | RAM at runtime | VRAM | Hardware tier | Example machine |
|---|---|---|---|---|---|
llama3.2:3b | ~2 GB | ~3 GB | optional | Entry-level laptop | Any laptop with 8 GB RAM |
llama3.1:8b (course default) | ~5 GB | ~6 GB | optional (×3 speed-up with 8 GB VRAM) | Standard laptop / desktop | 16 GB RAM, recent i5 / i7 / Ryzen 5 |
qwen2.5:7b, qwen2.5:14b | ~5 – 9 GB | ~6 – 10 GB | recommended 8 – 12 GB VRAM | Standard or gaming desktop | 32 GB RAM + RTX 3060 / 4060 |
qwen2.5-coder:14b | ~9 GB | ~10 GB | required 12 GB VRAM (otherwise CPU-only, very slow) | Workstation / gaming PC | 32 GB RAM + RTX 4070 12 GB |
llama3.3:70b, llama3.1:70b | ~40 GB | ~48 GB | required 48 GB VRAM or 64 GB+ unified RAM | High-end workstation | 64 GB RAM + RTX A6000 / 6000 Ada, or Mac M3 Max 64 – 128 GB |
llama3.1:405b | ~230 GB | ~250 GB (needs unified memory) | not feasible on consumer hardware | Pro workstation / lab | DGX Spark / Dell Pro Max GB10 (128 GB unified) and equivalent classes |
llama4, grok-2 (200 B+ open weights) | ~120 – 300 GB | ~150 GB+ unified | not feasible | Pro workstation / lab | DGX Spark / Dell Pro Max GB10 and equivalent classes |
Unified memory matters. On a DGX Spark, a Dell Pro Max with GB10, or an Apple Silicon Mac, CPU and GPU share the same memory pool without copy. A 70 B model that needs 48 GB simply does not fit in a 24 GB RTX 4090 — but fits comfortably in a 128 GB unified GB10 or M3 Max. Above ~30 B parameters, unified memory becomes the dominant constraint.
View 2 — by machine class
Section titled “View 2 — by machine class”What runs comfortably:
llama3.2:3b(Q4) — ~3 GB RAM, fast.llama3.1:8b(Q4) — ~6 GB RAM, the course default. Around 5 – 15 tokens/s on a recent CPU.qwen2.5-coder:7b(Q4) — ~6 GB RAM, the code-specialized tested alternative. Same speed range asllama3.1:8b. Pick it when Java output quality matters; the demo’s fallback parser handles its non-structured tool calls.
What is too heavy:
- 14 B models in CPU-only mode — usable but slow (1 – 3 tokens/s), painful for a live demo.
- All 70 B+ models — they will swap and grind for several seconds per token.
Recommendation for the course: start with llama3.1:8b (default) and, if Java code quality is the focus, switch the demo-3/4 MODEL_NAME to qwen2.5-coder:7b. Both were validated on this hardware tier.
What runs comfortably:
- All 3 – 8 B models — fully on GPU, 30 – 60 tokens/s. Includes
llama3.1:8b(default) andqwen2.5-coder:7b(code-specialized alternative). qwen2.5:14bandqwen2.5-coder:14b(Q4) — fits in 12 GB VRAM, ~20 – 30 tokens/s. Good for code-generation experiments outside of strict tool calling.
What is too heavy:
- 70 B models — VRAM too tight even in Q4; CPU offload pulls speed down to 1 – 5 tokens/s.
Recommendation for the course: llama3.1:8b for the canonical run; qwen2.5-coder:7b for the same demos when you want nicer Java (fallback parser kicks in). Free to experiment with 14 B models for side projects, but expect the same tool-call format issue scaled up.
What runs comfortably:
- All 3 – 14 B models — instantaneous (50 – 100+ tokens/s).
llama3.1:70b(Q4) — fits, ~10 – 25 tokens/s depending on hardware. Big jump in instruction-following and reasoning quality.
What is still out of reach:
- 200 B+ models (Llama 4, Grok-2, Llama 3.1 405B) — even in Q4 they exceed 100 GB of memory.
Recommendation for the course: ideal for showing in class the qualitative gap between llama3.1:8b and llama3.1:70b on the same agent task — same prompt, same tools, very different code quality.
Reference machines: NVIDIA DGX Spark, Dell Pro Max with GB10, and equivalent platforms built around the Grace Blackwell GB10 superchip — 128 GB unified memory, around 1 PFLOP in FP4.
What becomes possible:
- All 70 B models in Q8 or FP16 — near-research quality at the desk.
llama3.1:405b(Q4) — runs locally for the first time on a desktop-class machine.llama4andgrok-2open-weights releases (200 B+) — runnable locally in Q4.- Multi-agent setups where two or three 14 B agents live in memory simultaneously without swap.
Recommendation for the course: if such a machine is available in a lab or department, the most pedagogically valuable demonstration is to run exactly the same agent prompt of demo 3 on llama3.1:8b, llama3.1:70b, and llama3.1:405b side by side, and let students see — with their own eyes — the qualitative jump (and the diminishing returns at the top).
Classroom recommendation
Section titled “Classroom recommendation”| Configuration | Recommended model | Alternative tested | Why |
|---|---|---|---|
| Standard PC 8 GB RAM, no GPU | llama3.1:8b | qwen2.5-coder:7b (with fallback parser) | Course default, balanced RAM / quality. qwen2.5-coder:7b is the code-specialized alternative when you want better Java output and accept slower runs. |
| PC 16 GB RAM, no GPU | llama3.1:8b | qwen2.5-coder:7b | Same as above, smoother. |
| PC with NVIDIA 8 GB VRAM GPU | llama3.1:8b | qwen2.5-coder:7b or qwen2.5:14b | You can go bigger; llama3.1:8b keeps tool calling clean, the qwen-coder line wins on Java quality. |
| High-end workstation (48+ GB VRAM or unified) | llama3.1:70b | llama3.1:8b for the canonical run | llama3.1:70b is explicitly cited as “excellent” in the demo 4 README. Drop-in upgrade. |
| Light laptop 4 GB RAM | llama3.2:3b | — | Only viable choice; quality is visibly degraded; tool calling is “acceptable” but not “reliable”. |
For a demo 4 shown to a teacher, stick with llama3.1:8b. That’s the configuration tested end-to-end with the 8 menu prompts, the Verify agent and the JUnit Tests agent. If you swap it for qwen2.5-coder:7b to showcase nicer Java code, expect slower runs (the fallback parser does extra work) and occasional missing files (recap from the demo README: 2/3 file reliability on the canonical task).
Key takeaways
Section titled “Key takeaways”- The course default for the agent demos is
llama3.1:8b— the best compromise on 8 GB RAM and the only model with consistently structuredtool_callsin our measurements. - The documented code-specialized alternative is
qwen2.5-coder:7b. It writes higher-quality Java, but emits its tool calls insidemessage.content, which is why the demo code includes theparse_pseudo_tool_callsfallback parser — explicitly so this model remains a viable swap. - The workstation upgrade is
llama3.1:70b— explicitly cited as “excellent” in the demo 4 README. - For tool calling, size matters less than fine-tuning.
qwen2.5-coder:14bis bigger thanllama3.1:8bbut its tool-call output is worse. - Check the
toolsmention on the model’s page in the Ollama library before assuming any model can do tool calling — and read the journal of attempts in the demo 3 README to see how that capability label was not enough to guarantee structured output for qwen-coder. - To switch model in our demos: one line in
agent.py, oneollama pull, done. - Unified memory (DGX Spark, Dell Pro Max GB10, Apple Silicon) is the gating factor above ~30 B parameters — separate RAM and VRAM cannot scale to 200 B+ models, no matter how much you stack.