Skip to content

Demo 3 — a simple CLI agent

Duration: 15 min Prerequisites: chapter 06 (environment installed)

Source code

Repo: gneuroneai/ollama-demo-3-agent-java — one agent_java.py file, ~30 useful lines of agent loop.

Terminal window
git clone https://github.com/gneuroneai/ollama-demo-3-agent-java.git
cd ollama-demo-3-agent-java
.\start.ps1

This project is a self-contained agent loop of about thirty useful lines, written in a single Python file agent_java.py. It exposes four Python functions to the language model as tools (list_files, read_file, write_file, compile_java) and lets the model invoke them freely to create and compile a small Java console program from a natural-language instruction. Each tool call is logged to the terminal as it happens, which makes the tool calling mechanism introduced in chapter 03 directly observable. The output of a successful run is an actual Java project on disk — source files, compiled classes, console output — produced step by step by the model under the supervision of the Python loop.

This repository is the practical core of the course: one Python file, four tools, one loop. Running it produces, step by step, a small Java console application generated from a natural-language instruction.


You run:

Terminal window
cd ollama-demo-3-agent-java
.\.venv\Scripts\activate
python agent_java.py

The script:

  1. sends the model a system prompt describing it as a Java agent with 4 tools;
  2. asks it to create a product management app;
  3. loops until compilation succeeds or 10 turns are spent.

Typical terminal output:

--- Step 1 ---
[tool] write_file -> Product.java
File created or modified: Product.java (28 lines)
--- Step 2 ---
[tool] write_file -> ProductManager.java
File created or modified: ProductManager.java (35 lines)
--- Step 3 ---
[tool] write_file -> Main.java
File created or modified: Main.java (15 lines)
--- Step 4 ---
[tool] compile_java
Compilation successful.
--- Step 5 ---
Model: I created Product, ProductManager and Main, and compilation succeeded.
(no tool calls, agent stops)

Four steps to generate a compilable Java project. Everything lands in ollama-demo-3-agent-java/workspace/.


ToolLine in agent_java.pyWhat it does
list_files~153Lists files in workspace/. Useful for the model to know where it stands.
read_file(path)~165Reads a file already written. Useful when fixing an error.
write_file(path, content)~230Creates or overwrites a file in workspace/. Filters by extension.
compile_java()~259Runs javac -encoding UTF-8 *.java via subprocess. Returns success or errors.
flowchart LR
  Model["llama3.1:8b"]
  subgraph tools [4 Python tools]
      T1["list_files"]
      T2["read_file"]
      T3["write_file"]
      T4["compile_java"]
  end
  World[("workspace/")]
  Javac[("javac")]

  Model -->|"tool_calls"| tools
  T1 --> World
  T2 --> World
  T3 --> World
  T4 --> Javac
  Javac --> World
  tools -->|"text result"| Model
The model can do nothing but ask these 4 tools.

That’s all. The model can do nothing else. No network, no deletion, no arbitrary execution.


A common confusion in class: the four tools are not built into Ollama, into the model, or into any external library. They are our own Python code, written by hand in agent_java.py between lines 153 and 282.

What is ours (agent_java.py)What comes from outside
The four tool functions themselves (~130 lines of Python)The ollama Python SDK (one pip install ollama, ~80 KB) — talks to the local Ollama daemon over HTTP on 127.0.0.1:11434
The tools = [list_files, read_file, write_file, compile_java] registration listPython standard library: pathlib, json, os, re, subprocess, sys, time
The system prompt (lines 39 – 72)The model weights pulled by ollama pull llama3.1:8b (~4.9 GB on disk)
The agent loop (~30 lines, from line 486)The javac compiler from the JDK, invoked by compile_java
The fallback parser parse_pseudo_tool_calls

A single line near the top of agent_java.py wires everything together:

tools = [list_files, read_file, write_file, compile_java]
available_functions = {
"list_files": list_files,
"read_file": read_file,
"write_file": write_file,
"compile_java": compile_java,
}

When Python passes those four function references to client.chat(tools=...), the Ollama SDK automatically generates the JSON tool schema (name, parameters, types, description) by reading the function’s type hints and docstring. That JSON schema is what gets sent to the model. The Python code itself is never sent — the model only knows that a tool called write_file exists and accepts a path: str and a content: str.

The full chain on a single tool call:

  1. We write a Python function with type hints and a docstring.
  2. The Ollama SDK turns it into a JSON tool schema and ships that schema in the chat request.
  3. The model reads the schema, decides which tool to call and with which arguments, and emits a tool_calls entry.
  4. Our loop receives tool_calls, looks up the matching Python function in available_functions, and calls it with the model-provided arguments.
  5. The tool’s return value is appended to the conversation history with role "tool", so the model can react on the next turn.

That is everything. No framework. No decorator. No central registry. If you want to add a fifth tool — say git_status() — you write a Python function, append it to the tools list and to available_functions, and the model will start using it on the next run.


What is “built into” the model — and into Ollama

Section titled “What is “built into” the model — and into Ollama”

The previous section showed that the tools are our code. The natural follow-up: if we write the tools, what does the model actually contribute, and what does Ollama add in the middle? This section answers it directly. It is the heart of why llama3.1:8b works cleanly on this demo and qwen2.5-coder:7b does not — even though both models declare the same Capabilities: tools on the Ollama library.

The three layers, from the function call to the model weights

Section titled “The three layers, from the function call to the model weights”

Three layers of code and data cooperate to make client.chat(model=..., tools=[...]) work.

LayerWhat it isWho wrote itWhere it lives
1. Our codeThe 4 Python functions, the tools list, the loop, the fallback parserUs, in agent_java.pyOne file, ~700 lines total, ~30 lines for the loop itself
2. The Ollama runtimeChat template, JSON-schema generation from Python type hints, prompt formatting, and the extraction of tool_calls from the model’s raw outputThe Ollama teamThe ollama daemon listening on 127.0.0.1:11434
3. The model weightsBillions of numbers that were trained to recognise the system prompt + tool schema, and to emit a tool call in a specific format when appropriateMeta (Llama), Alibaba (Qwen), Google (Gemma)… during fine-tuningThe .gguf file pulled by ollama pull

Each layer is necessary. Each one is independent. Misalignment between any two of them is what makes a model “declare tools” but fail to use them.

Layer 2 in detail — Ollama’s invisible work

Section titled “Layer 2 in detail — Ollama’s invisible work”

When your Python code calls client.chat(model="llama3.1:8b", tools=[list_files, read_file, write_file, compile_java]), four things happen in the Ollama runtime without you seeing them:

  1. Schema generation. The SDK reads each Python function’s type hints and docstring and turns it into a JSON tool schema (name, parameter names, types, descriptions). That schema is what gets sent to the model — not the Python code itself.

  2. Chat-template injection. Every model in Ollama ships with a chat template (a small Go template that formats the conversation into the exact byte sequence the model was trained to consume). The template for a tool-calling model has a {{ if .Tools }}...{{ end }} block that knows where in the prompt the tool schemas must appear, and in which format that specific model family expects them.

  3. Model invocation. The formatted byte sequence is fed to the model, which generates a response token by token.

  4. Tool-call extraction. As the response streams back, Ollama watches for the special tokens or string patterns that the model was trained to emit when it wants to call a tool. If those patterns are detected, the matching substring is lifted out of the raw text and placed in the structured field response.message.tool_calls. Anything that does not match stays in response.message.content.

That fourth step is the one that succeeds with llama3.1:8b and fails with qwen2.5-coder:7b. It deserves its own paragraph.

Layer 3 in detail — what “trained for tool calling” really means

Section titled “Layer 3 in detail — what “trained for tool calling” really means”

During fine-tuning, the model is shown thousands of conversations that look roughly like this (schematically — each family uses its own special tokens):

User: "What's the weather in Paris?"
Assistant: <special_tool_call_start>
{"name": "get_weather", "arguments": {"location": "Paris"}}
<special_tool_call_end>
Tool: {"temperature": 22, "conditions": "sunny"}
Assistant: "It's sunny and 22°C in Paris."

Through this kind of training, the model learns three things at once:

  1. When to emit a tool call (after seeing a request it cannot answer from its own knowledge alone).
  2. Which tool to pick from the schema list passed in the prompt.
  3. How to format the call — using the exact special tokens the family’s chat template expects.

The “special tokens” are model-family-specific:

  • Llama 3.1 uses tokens like <|python_tag|> and <|eom_id|> around tool-call JSON.
  • Qwen 2.5 uses an XML-like wrapper such as <tool_call></tool_call>.
  • Mistral has yet another convention.

The match between the special tokens the model emits and the chat template Ollama uses to parse them back is what makes a model “good at tool calling”. If the model emits something close but not identical to what the template expects, the parser fails to lift it into tool_calls, and the JSON ends up in message.content instead. That is exactly the failure mode of qwen2.5-coder:7b on this demo.

What “Capabilities: tools” on the Ollama library actually means

Section titled “What “Capabilities: tools” on the Ollama library actually means”

When you visit a model’s page on the Ollama library (for example ollama.com/library/llama3.1 or ollama.com/library/qwen2.5-coder), you see a row of badges. One of them may say Capabilities: tools.

That badge means: “the model card declares that this model was fine-tuned for tool calling.” It is a necessary condition, not a sufficient one. The badge is filled in by the publisher of the model; it is not the result of an automated benchmark Ollama runs.

You can see the same declaration locally:

Terminal window
ollama show llama3.1:8b
# ...
# Capabilities completion tools
ollama show qwen2.5-coder:7b
# ...
# Capabilities completion insert tools

Both models declare tools. Yet only llama3.1:8b populates the structured tool_calls field on our demo. The badge does not guarantee that the model’s emitted format matches what Ollama’s chat template for that family expects on every kind of prompt.

Why llama3.1:8b works cleanly and qwen2.5-coder:7b does not

Section titled “Why llama3.1:8b works cleanly and qwen2.5-coder:7b does not”

This is the explicit answer to the question that opens this section.

Aspectllama3.1:8bqwen2.5-coder:7b
Base familyMeta’s Llama 3.1, trained directly on tool-calling examplesAlibaba’s Qwen 2.5, then fine-tuned for code generation
Special-token format learnedLlama 3.1 tool-call tokens — matches Ollama’s Llama chat template byte-for-byteQwen 2.5 tool-call tokens — but drifted by the code-specialised fine-tune, so the format the model now emits is not the one Ollama’s Qwen chat template expects
What the model emits on our demo promptThe Llama 3.1 special tokens with valid JSON insideA JSON-shaped string inside the regular message.content, sometimes with embedded Java strings that break json.loads
What Ollama’s chat-template parser does with itLifts it into response.message.tool_callsLeaves it in response.message.content — the parser does not recognise it
What our Python code seesresponse.message.tool_calls = [<4 structured calls>], response.message.content = ""response.message.tool_calls = [], response.message.content = "{...JSON...}"
What our code does about itIterates tool_calls directlyCalls parse_pseudo_tool_calls() on content to extract the calls anyway

In one sentence: qwen2.5-coder:7b was retrained to be excellent at writing code, and that retraining slightly drifted the tool-call output away from the format Ollama’s Qwen chat template expects. Our fallback parser exists precisely to bridge that gap — see the journal of attempts in the demo README, attempts 5 to 7.

Three short commands cover everything in this section.

Terminal window
# 1. See what capabilities a model declares
ollama show llama3.1:8b
ollama show qwen2.5-coder:7b
# 2. See the exact chat template used for that model
ollama show --modelfile llama3.1:8b | Select-String -Pattern 'TEMPLATE','TOOL'
# 3. Hit the API directly and see what the structured field actually contains
curl.exe -s http://127.0.0.1:11434/api/chat -d '{
"model": "qwen2.5-coder:7b",
"stream": false,
"messages": [{"role": "user", "content": "What time is it in Paris?"}],
"tools": [{"type":"function","function":{"name":"get_time","description":"Get current time","parameters":{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}}}]
}' | ConvertFrom-Json | Select-Object -ExpandProperty message

On qwen2.5-coder:7b you will typically see the JSON of the tool call inside the content field and an empty tool_calls array. On llama3.1:8b, the opposite: content is mostly empty and tool_calls holds the structured call.

The model brings the trained capability to emit tool calls in a specific format. Ollama brings the chat template, the schema generation, and the extraction of structured tool_calls from the model’s raw output. We bring the Python functions, the agent loop, and — when the model’s format drifts from what Ollama’s parser expects — a fallback parser to bridge the gap.


Hand-coded loop vs LangChain, LangGraph and the alternatives

Section titled “Hand-coded loop vs LangChain, LangGraph and the alternatives”

A natural follow-up question: why write the loop ourselves when frameworks exist for that? The answer has two parts — a clarification about what is open source and what is commercial, then a pedagogical trade-off.

Open source vs commercial — what is actually what

Section titled “Open source vs commercial — what is actually what”
Tool / libraryLicenseCost of the library itselfPaid services around it?
Ollama (local runtime)MIT — open sourceFreeNone
The course demo loop (agent_java.py)MIT — open sourceFreeNone
LangChainMIT — open sourceFreeLangSmith (observability), LangGraph Platform (hosted deployment) — both optional
LangGraphMIT — open sourceFreeLangGraph Platform — optional
OpenWebUIMIT — open sourceFreeNone
Continue (VS Code extension)Apache 2.0 — open sourceFreeNone
OpenCode (CLI agent)MIT — open sourceFreeNone
OpenAI function callingProprietaryPay per tokenThe whole API is paid
Anthropic tool use (Claude)ProprietaryPay per tokenThe whole API is paid

LangChain, LangGraph, OpenWebUI, Continue and OpenCode are all open source and free. What is commercial in that ecosystem are the hosted services around them (LangSmith for telemetry, LangGraph Platform for hosted agents, OpenAI’s and Anthropic’s cloud APIs) — not the libraries themselves. So the question is not “free vs paid”, it is “hand-coded vs framework”.

What LangChain would look like for the same demo

Section titled “What LangChain would look like for the same demo”

If we rewrote demo 3 in LangChain + LangGraph, the result would look roughly like this:

from langchain_ollama import ChatOllama
from langchain_core.tools import tool
from langgraph.prebuilt import create_react_agent
@tool
def write_file(path: str, content: str) -> str:
"""Create or overwrite a file in workspace/."""
# ... same body as our write_file ...
@tool
def compile_java() -> str:
"""Compile every Java file in workspace/."""
# ... same body as our compile_java ...
llm = ChatOllama(model="llama3.1:8b", num_ctx=20480)
agent = create_react_agent(
llm,
tools=[list_files, read_file, write_file, compile_java],
)
result = agent.invoke({"messages": [("user", DEFAULT_USER_PROMPT)]})

That is fewer lines in the calling code, but it pulls in two extra dependencies (langchain, langgraph), each with a tree of 30 – 50 transitive packages, and it hides the loop behind agent.invoke. You no longer see when the model is called, when tool_calls are parsed, when the results are fed back. You also lose the ability to plug in our fallback parser parse_pseudo_tool_calls cleanly — LangChain’s tool-calling expects the structured channel and falls back to its own internal logic when the model misbehaves.

CriterionHand-coded (our choice)LangChain + LangGraph
Lines of code to read to understand the agent~30A framework’s worth of indirection
External dependenciesollama onlylangchain, langgraph, plus dozens of transitive
Visibility of the loopEvery step printed and explainableHidden behind agent.invoke
Handling quirky models (e.g. qwen2.5-coder:7b fallback parser)Trivial — edit one helperRequires diving into LangChain parser internals
Pedagogical valueHigh — students step through every lineLower — students learn a framework, not the protocol
Production realism on large agentsLower — you would add structure for big projectsHigher — battle-tested across many production agents
Cost to add a toolAdd a Python function to two listsAdd an @tool-decorated function

The course’s goal is to understand what an agent is. A 30-line loop that prints what it does at every step is the right pedagogical artefact for that goal. Once a student has read those 30 lines and run them on their laptop, the leap to LangChain is small: they already know what the framework is doing under the hood. The reverse is not true — starting with LangChain often leaves students unable to explain what agent.invoke actually does.

When you would switch to LangChain in a real project

Section titled “When you would switch to LangChain in a real project”

In a real codebase, the right question is “is the marginal benefit of the framework greater than the cost of the dependency and the loss of control?”. Honest signals that you should switch:

  • You need streaming, retries, multiple LLM backends swappable behind a flag and observability out of the box.
  • You are building a graph of agents (5+ specialised roles) where writing the orchestration by hand would be error-prone.
  • Your team already speaks LangChain and onboarding a new framework would slow everyone down.
  • You want to plug in memory backends (Redis, Postgres, vector stores) without writing the glue.

For the workshop, none of those apply. A 30-line loop is the right tool. The course keeps it. Demo 4 (chapter 10) reuses exactly the same loop to drive three specialised agents in sequence, which proves the hand-coded approach scales further than people expect.


The loop lives in agent_java.py from line 486:

for step in range(1, MAX_STEPS + 1):
stats["turns"] = step
ui.step_start(step, time.monotonic() - started)
response = client.chat(
model=MODEL_NAME,
messages=messages,
tools=tools,
options={"num_ctx": 20480},
)
messages.append(response.message)
if response.message.content:
ui.model_message(response.message.content)
calls = list(iter_tool_calls(response.message))
if not calls:
break
for name, args in calls:
stats["tool_calls"] += 1
ui.tool_call(name, args)
fn = available_functions.get(name)
if fn is None:
result = f"Unknown tool: {name}"
else:
try:
result = fn(**args)
except Exception as error:
result = f"Error while running the tool: {error}"
ui.tool_result(name, result)
messages.append(
{"role": "tool", "tool_name": name, "content": str(result)}
)

Line by line:

  • response = client.chat(...): we send the whole history to the model, plus the tool list. The SDK handles the JSON schema.
  • messages.append(response.message): we append the model’s reply (including any tool_calls) to the history. That’s what gives the agent “memory”.
  • calls = list(iter_tool_calls(...)): extract tool calls. iter_tool_calls is a helper that handles two cases: real tool_calls (Llama 3.1’s structured channel) and the pseudo tool_calls Qwen slips into message.content (fallback).
  • if not calls: break: if the model didn’t call any tool, it’s done. We exit.
  • fn(**args): we call the real Python function with the arguments the model gave us. This is where the bridge to the real world materialises.
  • messages.append({"role": "tool", ...}): we hand the result back to the model so it can take it into account next turn.

MAX_STEPS = 10 is the safety net: in case of infinite loop, we cut it.

Concrete examples: how the value of MAX_STEPS shapes the agent’s behaviour

The agent loop runs at most MAX_STEPS iterations before it is forced to stop. Each iteration is one round of model_reply → tool_calls → tool_results → next model_reply. The right value depends on the task complexity.

MAX_STEPSTypical behaviour on the “small Java console app” taskRisk profile
2The model writes one file then is cut off before compiling and fixing errors. Demo fails.Too tight — no recovery room.
5Often enough for a one-class app. Two files + compile + fix = 4–5 turns.Acceptable for very simple cases.
10 (course default)Comfortable for the canonical demo (1 main + 1 helper class + compile + fix once).Sweet spot for an 8B model on small tasks.
30Allows multi-class iteration, several compile/fix loops, exploration.The model may start looping if the prompt is ambiguous (rereads the same file 5 times).
100+Unbounded exploration.A misaligned model can burn tokens running in circles for half an hour. Always pair with a time-out.
MAX_STEPS = 10
for step in range(1, MAX_STEPS + 1):
response = client.chat(
model=MODEL_NAME,
messages=messages,
tools=tools,
)
calls = list(iter_tool_calls(response))
if not calls:
break
for call in calls:
result = run_tool(call)
messages.append({"role": "tool", "content": str(result)})

Two operational rules:

  1. MAX_STEPS is a safety net, not a feature. A well-formed task should converge in 3–7 turns. If the agent regularly hits the ceiling, the system prompt or the toolset has a problem — raising the ceiling only postpones the failure.
  2. In production, always pair MAX_STEPS with a wall-clock timeout (e.g. 5 minutes total). A 10-step loop with a model stuck on a very long generation can still take 20 minutes without ever incrementing step.

The system prompt lives on lines 39-72 of agent_java.py. It does two things:

  1. defines the role: “You are an autonomous Java development agent. You MUST act through the provided tools”;
  2. lists hard rules: no private, add the java.util.* imports, write a complete file every time (never ...), don’t create empty stubs.

Why so many rules? Because llama3.1:8b is an 8-billion-parameter model: good enough to follow clear instructions, not enough to guess project conventions. Each rule added fixes a real bug observed during demo development.

That’s the topic of chapter 11.


Run the demo with screen sharing. Ask students to point at the terminal:

  1. Where do we see the model “thinking”? → the --- Step N --- line that starts, just before the first [tool].
  2. Where does the model want to create a file vs actually create it? → the wish: the [tool] write_file -> Foo.java line (what it asked for). The reality: the File created or modified: Foo.java (N lines) line (what your code did).
  3. What happens if compilation fails? → the return of compile_java is appended to messages, the model sees the error on the next turn, calls read_file and write_file to fix.

Test the generated program:

Terminal window
cd workspace
java Main
cd ..

You should see the product list and the total stock value.


Edit the DEFAULT_USER_PROMPT in agent_java.py (line ~75) to ask for something else than product management. Ideas:

  • a tiny bank with Account and Bank;
  • a FizzBuzz with a FizzBuzz class (don’t dump everything into Main);
  • a Calculator with four static methods.

Empty the workspace first:

Terminal window
Remove-Item workspace\*.java, workspace\*.class -Force
python agent_java.py

Watch: which tools get called? How many turns? If compilation fails, does the model fix it on its own?


  • 4 tools, ~30-line loop, one system prompt, and you have a working agent.
  • The model only requests actions; it’s your Python functions that actually act.
  • The system-prompt rules compensate for an 8B model’s blind spots (no shortcuts, plenty of guardrails).
  • This demo is the canonical pedagogical reference used throughout the course to explain what an agent is. Everything else (demo 4, the user interface, per-project isolation) is added structure around the same loop.