Skip to content

Choosing a local model

Duration: 8 min Prerequisites: chapter 05a

Now that we’re running locally (chapter 05a), the real question is: which model? For an agent that uses tools reliably, you need a model fine-tuned for tool calling. Size is not what matters most: fine-tuning is. The course default is llama3.1:8b — but qwen2.5-coder:7b is a fully supported alternative (via the fallback parser), and llama3.1:70b is the recommended uplift when a workstation is available.


What each model can and cannot do — explicit summary

Section titled “What each model can and cannot do — explicit summary”

The table below is the single source of truth for “should I use this model on the course demos?”. Three rows are highlighted: those are the models the demo code (agent_java.py) was tested with.

ModelTested on the demos?What it can doWhat it cannot doUse it on the course demos?
llama3.1:8bYES — course defaultEmit clean structured tool_calls (4/4 in one turn on the canonical task), follow the rules of a system prompt, generate small Java files, run on a laptop with 8 GB RAMRival a 70B on complex reasoning, produce flawless production-grade code on the first try, manage large multi-file projectsYES — the course reference. The safest pick.
qwen2.5-coder:7bYES — documented code-specialized alternativeWrite higher-quality Java code than llama3.1:8b on the same prompt; tool calling via the fallback parser parse_pseudo_tool_calls (included in the demo code precisely for this model)Emit structured tool_calls (the JSON lands in message.content instead of the structured field, and Java quote-escaping is sometimes broken — recap from the demo README: 2 of 3 files created reliably on the canonical task)YES, with caveats. Pick it if Java code quality matters more than tool-call cleanliness. Slower (fallback parser path).
llama3.1:70bYES — on workstation (cited in demo 4 README as “excellent aussi”)Strong uplift in instruction-following, near-perfect Java, robust structured tool callingRun on a laptop — needs 48+ GB VRAM or unified memoryYES if you have the hardware. Drop-in upgrade of llama3.1:8b for showing qualitative jumps.
qwen2.5-coder:14bNot tested (evaluated only)Better Java than the 7b sibling, code-specializedRun on standard hardware (needs 12+ GB VRAM); tool-calling format same issue as the 7b (no structured tool_calls)NO unless you have the hardware AND accept the fallback parser caveat
qwen2.5-coder:3b / :0.5bEvaluated and rejected (attempts 1 and 2 of the journal)Run on very low RAMReliable tool calling — pure text output, no JSON at all on the agent taskNO — too small for the structured tool-calling protocol
llama3.2:3b / :latest (3B)Evaluated and rejected (attempt 4)Run on 4 GB RAMTool calling — produces malformed JSON (pseudo-tag syntax) that even the fallback parser cannot recoverNO — fails both the structured and the fallback path
llama3.1:405b, llama4, grok-2 (200 B+)Not tested (Pro workstation hardware required)Approach commercial-grade quality 100% locallyRun anywhere except a DGX Spark / Dell Pro Max GB10 class machine (128 GB unified)NO for standard course delivery — out of hardware scope
mistral:7b, command-r:7b, firefunction-v2Not testedDocumented tool-calling support on the Ollama libraryUnknown behaviour on the canonical task — not benchmarkedNO — substitution untested
gemma:2b, phi3, all “base” modelsNot testedGeneral chatReliable structured tool_callsNO — not fine-tuned for tools

One-line summary: for the agent demos, the validated set is llama3.1:8b (default), qwen2.5-coder:7b (code-specialized alternative via the fallback parser), and llama3.1:70b (workstation uplift). Anything else, you are pioneering.


ModelSizeDiskRAM/VRAMTool callingJava code qualityVerdict
qwen2.5-coder:0.5b0.5 B~400 MB2 GBnon-existentbasictoy
qwen2.5-coder:3b3 B~1.9 GB4 GBvery unstablegoodpersonal demo
qwen2.5-coder:7b7 B~4.7 GB8 GBunstable* — works via fallback parserexcellentdocumented code-specialized alternative
llama3.1:8b8 B~4.9 GB8 GBreliable (structured)goodcourse default — our recommendation
qwen2.5-coder:14b14 B~9 GB12 GBunstablevery goodtoo heavy without GPU
llama3.2:3b3 B~2 GB4 GBacceptableaverageif low on RAM

* “Unstable on tool calling” means: Qwen produces JSON for tool calls but with poorly-escaped Java strings, which breaks json.loads. The comment at the top of ollama-demo-3-agent-java/agent_java.py explains the choice precisely:

Why llama3.1:8b and not qwen2.5-coder? Qwen2.5-Coder writes great Java but does NOT reliably populate the structured tool_calls field of Ollama’s response: it emits JSON inside message.content with often-malformed escapes for embedded Java strings, which breaks json.loads.


It’s not “bigger is better”. For tool calling:

qwen2.5-coder:14b Perfect Java, broken tool calling NO (without fallback parser)
llama3.1:8b OK Java, clean tool calling YES — course default
qwen2.5-coder:7b Excellent Java, broken tool calling YES via fallback parser (alternative)
qwen2.5-coder:3b Decent Java, broken tool calling NO
llama3.2:3b Average Java, acceptable tool calling MAYBE

Counter-intuitive, but consistent. Llama 3.1 was specifically trained by Meta to emit structured tool_calls in JSON, on a separate channel from the text reply. Qwen2.5-Coder was trained to write code; it intellectually knows there’s a tool-call format to respect, but it slips in unescaped quotes and its output stops being parsable. See also the fallback parser (parse_pseudo_tool_calls) we had to add in demo 3 to catch those broken outputs.


How to tell if a model “can” do tool calling

Section titled “How to tell if a model “can” do tool calling”

The https://ollama.com/library page tags each model with its capabilities. Look for the tools mention on a model’s card. Known to do reliable tool calling:

  • llama3.1:8b, llama3.1:70b
  • llama3.2:3b
  • mistral:7b (recent versions)
  • qwen2.5:7b / qwen2.5:14b (the generalist version, not Coder)
  • firefunction-v2
  • command-r:7b

To avoid for tool calling:

  • qwen2.5-coder:* (any size) — excellent for chat code generation, bad at the tool-call format;
  • gemma:2b, phi3 — not (or poorly) fine-tuned for tools;
  • all “base” models (no instruct/chat suffix).

In ollama-demo-4-trio-agents-java/agent.py:

MODEL_NAME = "llama3.1:8b"

A single constant. To test another model, change the line, run ollama pull <new_model>, and rerun. The agent logic depends on nothing model-specific, except that the model must produce clean tool_calls.

In demo 4, the Streamlit UI doesn’t (yet) expose this choice in the sidebar — that’s something we could add as an exercise (chapter 13). The model is shown read-only in the configuration panel.


Two complementary views of the same question — given my model, what hardware do I need? and given my hardware, what model can I run?

View 1 — pivot table: model → hardware tier

Section titled “View 1 — pivot table: model → hardware tier”

All values assume Ollama’s default Q4_K_M quantization (chapter 05a). 8b = 8 billion parameters; 70b = 70 billion; 405b = 405 billion.

ModelDiskRAM at runtimeVRAMHardware tierExample machine
llama3.2:3b~2 GB~3 GBoptionalEntry-level laptopAny laptop with 8 GB RAM
llama3.1:8b (course default)~5 GB~6 GBoptional (×3 speed-up with 8 GB VRAM)Standard laptop / desktop16 GB RAM, recent i5 / i7 / Ryzen 5
qwen2.5:7b, qwen2.5:14b~5 – 9 GB~6 – 10 GBrecommended 8 – 12 GB VRAMStandard or gaming desktop32 GB RAM + RTX 3060 / 4060
qwen2.5-coder:14b~9 GB~10 GBrequired 12 GB VRAM (otherwise CPU-only, very slow)Workstation / gaming PC32 GB RAM + RTX 4070 12 GB
llama3.3:70b, llama3.1:70b~40 GB~48 GBrequired 48 GB VRAM or 64 GB+ unified RAMHigh-end workstation64 GB RAM + RTX A6000 / 6000 Ada, or Mac M3 Max 64 – 128 GB
llama3.1:405b~230 GB~250 GB (needs unified memory)not feasible on consumer hardwarePro workstation / labDGX Spark / Dell Pro Max GB10 (128 GB unified) and equivalent classes
llama4, grok-2 (200 B+ open weights)~120 – 300 GB~150 GB+ unifiednot feasiblePro workstation / labDGX Spark / Dell Pro Max GB10 and equivalent classes

Unified memory matters. On a DGX Spark, a Dell Pro Max with GB10, or an Apple Silicon Mac, CPU and GPU share the same memory pool without copy. A 70 B model that needs 48 GB simply does not fit in a 24 GB RTX 4090 — but fits comfortably in a 128 GB unified GB10 or M3 Max. Above ~30 B parameters, unified memory becomes the dominant constraint.

What runs comfortably:

  • llama3.2:3b (Q4) — ~3 GB RAM, fast.
  • llama3.1:8b (Q4) — ~6 GB RAM, the course default. Around 5 – 15 tokens/s on a recent CPU.
  • qwen2.5-coder:7b (Q4) — ~6 GB RAM, the code-specialized tested alternative. Same speed range as llama3.1:8b. Pick it when Java output quality matters; the demo’s fallback parser handles its non-structured tool calls.

What is too heavy:

  • 14 B models in CPU-only mode — usable but slow (1 – 3 tokens/s), painful for a live demo.
  • All 70 B+ models — they will swap and grind for several seconds per token.

Recommendation for the course: start with llama3.1:8b (default) and, if Java code quality is the focus, switch the demo-3/4 MODEL_NAME to qwen2.5-coder:7b. Both were validated on this hardware tier.


ConfigurationRecommended modelAlternative testedWhy
Standard PC 8 GB RAM, no GPUllama3.1:8bqwen2.5-coder:7b (with fallback parser)Course default, balanced RAM / quality. qwen2.5-coder:7b is the code-specialized alternative when you want better Java output and accept slower runs.
PC 16 GB RAM, no GPUllama3.1:8bqwen2.5-coder:7bSame as above, smoother.
PC with NVIDIA 8 GB VRAM GPUllama3.1:8bqwen2.5-coder:7b or qwen2.5:14bYou can go bigger; llama3.1:8b keeps tool calling clean, the qwen-coder line wins on Java quality.
High-end workstation (48+ GB VRAM or unified)llama3.1:70bllama3.1:8b for the canonical runllama3.1:70b is explicitly cited as “excellent” in the demo 4 README. Drop-in upgrade.
Light laptop 4 GB RAMllama3.2:3bOnly viable choice; quality is visibly degraded; tool calling is “acceptable” but not “reliable”.

For a demo 4 shown to a teacher, stick with llama3.1:8b. That’s the configuration tested end-to-end with the 8 menu prompts, the Verify agent and the JUnit Tests agent. If you swap it for qwen2.5-coder:7b to showcase nicer Java code, expect slower runs (the fallback parser does extra work) and occasional missing files (recap from the demo README: 2/3 file reliability on the canonical task).


  • The course default for the agent demos is llama3.1:8b — the best compromise on 8 GB RAM and the only model with consistently structured tool_calls in our measurements.
  • The documented code-specialized alternative is qwen2.5-coder:7b. It writes higher-quality Java, but emits its tool calls inside message.content, which is why the demo code includes the parse_pseudo_tool_calls fallback parser — explicitly so this model remains a viable swap.
  • The workstation upgrade is llama3.1:70b — explicitly cited as “excellent” in the demo 4 README.
  • For tool calling, size matters less than fine-tuning. qwen2.5-coder:14b is bigger than llama3.1:8b but its tool-call output is worse.
  • Check the tools mention on the model’s page in the Ollama library before assuming any model can do tool calling — and read the journal of attempts in the demo 3 README to see how that capability label was not enough to guarantee structured output for qwen-coder.
  • To switch model in our demos: one line in agent.py, one ollama pull, done.
  • Unified memory (DGX Spark, Dell Pro Max GB10, Apple Silicon) is the gating factor above ~30 B parameters — separate RAM and VRAM cannot scale to 200 B+ models, no matter how much you stack.