Demo 3 — a simple CLI agent

Duration: 15 min Prerequisites: chapter 06 (environment installed)

Source code

Repo: gneuroneai/ollama-demo-3-agent-java — one agent_java.py file, ~30 useful lines of agent loop.

git clone https://github.com/gneuroneai/ollama-demo-3-agent-java.git
cd ollama-demo-3-agent-java
.\start.ps1

What this demo is about

This project is a self-contained agent loop of about thirty useful lines, written in a single Python file agent_java.py. It exposes four Python functions to the language model as tools (list_files, read_file, write_file, compile_java) and lets the model invoke them freely to create and compile a small Java console program from a natural-language instruction. Each tool call is logged to the terminal as it happens, which makes the tool calling mechanism introduced in chapter 03 directly observable. The output of a successful run is an actual Java project on disk — source files, compiled classes, console output — produced step by step by the model under the supervision of the Python loop.

Key idea

This repository is the practical core of the course: one Python file, four tools, one loop. Running it produces, step by step, a small Java console application generated from a natural-language instruction.

What the demo does

You run:

cd ollama-demo-3-agent-java
.\.venv\Scripts\activate
python agent_java.py

The script:

sends the model a system prompt describing it as a Java agent with 4 tools;
asks it to create a product management app;
loops until compilation succeeds or 10 turns are spent.

Typical terminal output:

--- Step 1 ---
[tool] write_file -> Product.java
       File created or modified: Product.java (28 lines)

--- Step 2 ---
[tool] write_file -> ProductManager.java
       File created or modified: ProductManager.java (35 lines)

--- Step 3 ---
[tool] write_file -> Main.java
       File created or modified: Main.java (15 lines)

--- Step 4 ---
[tool] compile_java
       Compilation successful.

--- Step 5 ---
Model: I created Product, ProductManager and Main, and compilation succeeded.
(no tool calls, agent stops)

Four steps to generate a compilable Java project. Everything lands in ollama-demo-3-agent-java/workspace/.

The 4 tools, one by one

Tool	Line in `agent_java.py`	What it does
`list_files`	~153	Lists files in `workspace/`. Useful for the model to know where it stands.
`read_file(path)`	~165	Reads a file already written. Useful when fixing an error.
`write_file(path, content)`	~230	Creates or overwrites a file in `workspace/`. Filters by extension.
`compile_java()`	~259	Runs `javac -encoding UTF-8 *.java` via `subprocess`. Returns success or errors.

flowchart LR
  Model["llama3.1:8b"]
  subgraph tools [4 Python tools]
      T1["list_files"]
      T2["read_file"]
      T3["write_file"]
      T4["compile_java"]
  end
  World[("workspace/")]
  Javac[("javac")]

  Model -->|"tool_calls"| tools
  T1 --> World
  T2 --> World
  T3 --> World
  T4 --> Javac
  Javac --> World
  tools -->|"text result"| Model

The model can do nothing but ask these 4 tools.

That’s all. The model can do nothing else. No network, no deletion, no arbitrary execution.

Where do these tools come from?

A common confusion in class: the four tools are not built into Ollama, into the model, or into any external library. They are our own Python code, written by hand in agent_java.py between lines 153 and 282.

What is ours (`agent_java.py`)	What comes from outside
The four tool functions themselves (~130 lines of Python)	The `ollama` Python SDK (one `pip install ollama`, ~80 KB) — talks to the local Ollama daemon over HTTP on `127.0.0.1:11434`
The `tools = [list_files, read_file, write_file, compile_java]` registration list	Python standard library: `pathlib`, `json`, `os`, `re`, `subprocess`, `sys`, `time`
The system prompt (lines 39 – 72)	The model weights pulled by `ollama pull llama3.1:8b` (~4.9 GB on disk)
The agent loop (~30 lines, from line 486)	The `javac` compiler from the JDK, invoked by `compile_java`
The fallback parser `parse_pseudo_tool_calls`

A single line near the top of agent_java.py wires everything together:

tools = [list_files, read_file, write_file, compile_java]

available_functions = {
    "list_files":    list_files,
    "read_file":     read_file,
    "write_file":    write_file,
    "compile_java":  compile_java,
}

When Python passes those four function references to client.chat(tools=...), the Ollama SDK automatically generates the JSON tool schema (name, parameters, types, description) by reading the function’s type hints and docstring. That JSON schema is what gets sent to the model. The Python code itself is never sent — the model only knows that a tool called write_file exists and accepts a path: str and a content: str.

The full chain on a single tool call:

We write a Python function with type hints and a docstring.
The Ollama SDK turns it into a JSON tool schema and ships that schema in the chat request.
The model reads the schema, decides which tool to call and with which arguments, and emits a tool_calls entry.
Our loop receives tool_calls, looks up the matching Python function in available_functions, and calls it with the model-provided arguments.
The tool’s return value is appended to the conversation history with role "tool", so the model can react on the next turn.

That is everything. No framework. No decorator. No central registry. If you want to add a fifth tool — say git_status() — you write a Python function, append it to the tools list and to available_functions, and the model will start using it on the next run.

What is “built into” the model — and into Ollama

The previous section showed that the tools are our code. The natural follow-up: if we write the tools, what does the model actually contribute, and what does Ollama add in the middle? This section answers it directly. It is the heart of why llama3.1:8b works cleanly on this demo and qwen2.5-coder:7b does not — even though both models declare the same Capabilities: tools on the Ollama library.

The three layers, from the function call to the model weights

Three layers of code and data cooperate to make client.chat(model=..., tools=[...]) work.

Layer	What it is	Who wrote it	Where it lives
1. Our code	The 4 Python functions, the `tools` list, the loop, the fallback parser	Us, in `agent_java.py`	One file, ~700 lines total, ~30 lines for the loop itself
2. The Ollama runtime	Chat template, JSON-schema generation from Python type hints, prompt formatting, and the extraction of `tool_calls` from the model’s raw output	The Ollama team	The `ollama` daemon listening on `127.0.0.1:11434`
3. The model weights	Billions of numbers that were trained to recognise the system prompt + tool schema, and to emit a tool call in a specific format when appropriate	Meta (Llama), Alibaba (Qwen), Google (Gemma)… during fine-tuning	The `.gguf` file pulled by `ollama pull`

Each layer is necessary. Each one is independent. Misalignment between any two of them is what makes a model “declare tools” but fail to use them.

Layer 2 in detail — Ollama’s invisible work

When your Python code calls client.chat(model="llama3.1:8b", tools=[list_files, read_file, write_file, compile_java]), four things happen in the Ollama runtime without you seeing them:

Schema generation. The SDK reads each Python function’s type hints and docstring and turns it into a JSON tool schema (name, parameter names, types, descriptions). That schema is what gets sent to the model — not the Python code itself.
Chat-template injection. Every model in Ollama ships with a chat template (a small Go template that formats the conversation into the exact byte sequence the model was trained to consume). The template for a tool-calling model has a {{ if .Tools }}...{{ end }} block that knows where in the prompt the tool schemas must appear, and in which format that specific model family expects them.
Model invocation. The formatted byte sequence is fed to the model, which generates a response token by token.
Tool-call extraction. As the response streams back, Ollama watches for the special tokens or string patterns that the model was trained to emit when it wants to call a tool. If those patterns are detected, the matching substring is lifted out of the raw text and placed in the structured field response.message.tool_calls. Anything that does not match stays in response.message.content.

That fourth step is the one that succeeds with llama3.1:8b and fails with qwen2.5-coder:7b. It deserves its own paragraph.

Layer 3 in detail — what “trained for tool calling” really means

During fine-tuning, the model is shown thousands of conversations that look roughly like this (schematically — each family uses its own special tokens):

User: "What's the weather in Paris?"
Assistant: <special_tool_call_start>
  {"name": "get_weather", "arguments": {"location": "Paris"}}
<special_tool_call_end>
Tool: {"temperature": 22, "conditions": "sunny"}
Assistant: "It's sunny and 22°C in Paris."

Through this kind of training, the model learns three things at once:

When to emit a tool call (after seeing a request it cannot answer from its own knowledge alone).
Which tool to pick from the schema list passed in the prompt.
How to format the call — using the exact special tokens the family’s chat template expects.

The “special tokens” are model-family-specific:

Llama 3.1 uses tokens like <|python_tag|> and <|eom_id|> around tool-call JSON.
Qwen 2.5 uses an XML-like wrapper such as <tool_call> … </tool_call>.
Mistral has yet another convention.

The match between the special tokens the model emits and the chat template Ollama uses to parse them back is what makes a model “good at tool calling”. If the model emits something close but not identical to what the template expects, the parser fails to lift it into tool_calls, and the JSON ends up in message.content instead. That is exactly the failure mode of qwen2.5-coder:7b on this demo.

What “Capabilities: tools” on the Ollama library actually means

When you visit a model’s page on the Ollama library (for example ollama.com/library/llama3.1 or ollama.com/library/qwen2.5-coder), you see a row of badges. One of them may say Capabilities: tools.

That badge means: “the model card declares that this model was fine-tuned for tool calling.” It is a necessary condition, not a sufficient one. The badge is filled in by the publisher of the model; it is not the result of an automated benchmark Ollama runs.

You can see the same declaration locally:

ollama show llama3.1:8b
# ...
# Capabilities    completion    tools

ollama show qwen2.5-coder:7b
# ...
# Capabilities    completion    insert    tools

Both models declare tools. Yet only llama3.1:8b populates the structured tool_calls field on our demo. The badge does not guarantee that the model’s emitted format matches what Ollama’s chat template for that family expects on every kind of prompt.

Why `llama3.1:8b` works cleanly and `qwen2.5-coder:7b` does not

This is the explicit answer to the question that opens this section.

Aspect	`llama3.1:8b`	`qwen2.5-coder:7b`
Base family	Meta’s Llama 3.1, trained directly on tool-calling examples	Alibaba’s Qwen 2.5, then fine-tuned for code generation
Special-token format learned	Llama 3.1 tool-call tokens — matches Ollama’s Llama chat template byte-for-byte	Qwen 2.5 tool-call tokens — but drifted by the code-specialised fine-tune, so the format the model now emits is not the one Ollama’s Qwen chat template expects
What the model emits on our demo prompt	The Llama 3.1 special tokens with valid JSON inside	A JSON-shaped string inside the regular `message.content`, sometimes with embedded Java strings that break `json.loads`
What Ollama’s chat-template parser does with it	Lifts it into `response.message.tool_calls`	Leaves it in `response.message.content` — the parser does not recognise it
What our Python code sees	`response.message.tool_calls = [<4 structured calls>]`, `response.message.content = ""`	`response.message.tool_calls = []`, `response.message.content = "{...JSON...}"`
What our code does about it	Iterates `tool_calls` directly	Calls `parse_pseudo_tool_calls()` on `content` to extract the calls anyway

In one sentence: qwen2.5-coder:7b was retrained to be excellent at writing code, and that retraining slightly drifted the tool-call output away from the format Ollama’s Qwen chat template expects. Our fallback parser exists precisely to bridge that gap — see the journal of attempts in the demo README, attempts 5 to 7.

How to verify any of this yourself

Three short commands cover everything in this section.

# 1. See what capabilities a model declares
ollama show llama3.1:8b
ollama show qwen2.5-coder:7b

# 2. See the exact chat template used for that model
ollama show --modelfile llama3.1:8b | Select-String -Pattern 'TEMPLATE','TOOL'

# 3. Hit the API directly and see what the structured field actually contains
curl.exe -s http://127.0.0.1:11434/api/chat -d '{
  "model": "qwen2.5-coder:7b",
  "stream": false,
  "messages": [{"role": "user", "content": "What time is it in Paris?"}],
  "tools": [{"type":"function","function":{"name":"get_time","description":"Get current time","parameters":{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}}}]
}' | ConvertFrom-Json | Select-Object -ExpandProperty message

On qwen2.5-coder:7b you will typically see the JSON of the tool call inside the content field and an empty tool_calls array. On llama3.1:8b, the opposite: content is mostly empty and tool_calls holds the structured call.

One-line mental model

The model brings the trained capability to emit tool calls in a specific format. Ollama brings the chat template, the schema generation, and the extraction of structured tool_calls from the model’s raw output. We bring the Python functions, the agent loop, and — when the model’s format drifts from what Ollama’s parser expects — a fallback parser to bridge the gap.

Hand-coded loop vs LangChain, LangGraph and the alternatives

A natural follow-up question: why write the loop ourselves when frameworks exist for that? The answer has two parts — a clarification about what is open source and what is commercial, then a pedagogical trade-off.

Open source vs commercial — what is actually what

Tool / library	License	Cost of the library itself	Paid services around it?
Ollama (local runtime)	MIT — open source	Free	None
The course demo loop (`agent_java.py`)	MIT — open source	Free	None
LangChain	MIT — open source	Free	LangSmith (observability), LangGraph Platform (hosted deployment) — both optional
LangGraph	MIT — open source	Free	LangGraph Platform — optional
OpenWebUI	MIT — open source	Free	None
Continue (VS Code extension)	Apache 2.0 — open source	Free	None
OpenCode (CLI agent)	MIT — open source	Free	None
OpenAI function calling	Proprietary	Pay per token	The whole API is paid
Anthropic tool use (Claude)	Proprietary	Pay per token	The whole API is paid

LangChain, LangGraph, OpenWebUI, Continue and OpenCode are all open source and free. What is commercial in that ecosystem are the hosted services around them (LangSmith for telemetry, LangGraph Platform for hosted agents, OpenAI’s and Anthropic’s cloud APIs) — not the libraries themselves. So the question is not “free vs paid”, it is “hand-coded vs framework”.

What LangChain would look like for the same demo

If we rewrote demo 3 in LangChain + LangGraph, the result would look roughly like this:

from langchain_ollama import ChatOllama
from langchain_core.tools import tool
from langgraph.prebuilt import create_react_agent

@tool
def write_file(path: str, content: str) -> str:
    """Create or overwrite a file in workspace/."""
    # ... same body as our write_file ...

@tool
def compile_java() -> str:
    """Compile every Java file in workspace/."""
    # ... same body as our compile_java ...

llm = ChatOllama(model="llama3.1:8b", num_ctx=20480)
agent = create_react_agent(
    llm,
    tools=[list_files, read_file, write_file, compile_java],
)
result = agent.invoke({"messages": [("user", DEFAULT_USER_PROMPT)]})

That is fewer lines in the calling code, but it pulls in two extra dependencies (langchain, langgraph), each with a tree of 30 – 50 transitive packages, and it hides the loop behind agent.invoke. You no longer see when the model is called, when tool_calls are parsed, when the results are fed back. You also lose the ability to plug in our fallback parser parse_pseudo_tool_calls cleanly — LangChain’s tool-calling expects the structured channel and falls back to its own internal logic when the model misbehaves.

Why the course keeps it hand-coded

Criterion	Hand-coded (our choice)	LangChain + LangGraph
Lines of code to read to understand the agent	~30	A framework’s worth of indirection
External dependencies	`ollama` only	`langchain`, `langgraph`, plus dozens of transitive
Visibility of the loop	Every step printed and explainable	Hidden behind `agent.invoke`
Handling quirky models (e.g. `qwen2.5-coder:7b` fallback parser)	Trivial — edit one helper	Requires diving into LangChain parser internals
Pedagogical value	High — students step through every line	Lower — students learn a framework, not the protocol
Production realism on large agents	Lower — you would add structure for big projects	Higher — battle-tested across many production agents
Cost to add a tool	Add a Python function to two lists	Add an `@tool`-decorated function

The course’s goal is to understand what an agent is. A 30-line loop that prints what it does at every step is the right pedagogical artefact for that goal. Once a student has read those 30 lines and run them on their laptop, the leap to LangChain is small: they already know what the framework is doing under the hood. The reverse is not true — starting with LangChain often leaves students unable to explain what agent.invoke actually does.

When you would switch to LangChain in a real project

In a real codebase, the right question is “is the marginal benefit of the framework greater than the cost of the dependency and the loss of control?”. Honest signals that you should switch:

You need streaming, retries, multiple LLM backends swappable behind a flag and observability out of the box.
You are building a graph of agents (5+ specialised roles) where writing the orchestration by hand would be error-prone.
Your team already speaks LangChain and onboarding a new framework would slow everyone down.
You want to plug in memory backends (Redis, Postgres, vector stores) without writing the glue.

For the workshop, none of those apply. A 30-line loop is the right tool. The course keeps it. Demo 4 (chapter 10) reuses exactly the same loop to drive three specialised agents in sequence, which proves the hand-coded approach scales further than people expect.

Commented walk-through of the loop

The loop lives in agent_java.py from line 486:

for step in range(1, MAX_STEPS + 1):
    stats["turns"] = step
    ui.step_start(step, time.monotonic() - started)

    response = client.chat(
        model=MODEL_NAME,
        messages=messages,
        tools=tools,
        options={"num_ctx": 20480},
    )
    messages.append(response.message)

    if response.message.content:
        ui.model_message(response.message.content)

    calls = list(iter_tool_calls(response.message))
    if not calls:
        break

    for name, args in calls:
        stats["tool_calls"] += 1
        ui.tool_call(name, args)

        fn = available_functions.get(name)
        if fn is None:
            result = f"Unknown tool: {name}"
        else:
            try:
                result = fn(**args)
            except Exception as error:
                result = f"Error while running the tool: {error}"

        ui.tool_result(name, result)
        messages.append(
            {"role": "tool", "tool_name": name, "content": str(result)}
        )

Line by line:

response = client.chat(...): we send the whole history to the model, plus the tool list. The SDK handles the JSON schema.
messages.append(response.message): we append the model’s reply (including any tool_calls) to the history. That’s what gives the agent “memory”.
calls = list(iter_tool_calls(...)): extract tool calls. iter_tool_calls is a helper that handles two cases: real tool_calls (Llama 3.1’s structured channel) and the pseudo tool_calls Qwen slips into message.content (fallback).
if not calls: break: if the model didn’t call any tool, it’s done. We exit.
fn(**args): we call the real Python function with the arguments the model gave us. This is where the bridge to the real world materialises.
messages.append({"role": "tool", ...}): we hand the result back to the model so it can take it into account next turn.

MAX_STEPS = 10 is the safety net: in case of infinite loop, we cut it.

Concrete examples: how the value of MAX_STEPS shapes the agent’s behaviour

The agent loop runs at most MAX_STEPS iterations before it is forced to stop. Each iteration is one round of model_reply → tool_calls → tool_results → next model_reply. The right value depends on the task complexity.

`MAX_STEPS`	Typical behaviour on the “small Java console app” task	Risk profile
2	The model writes one file then is cut off before compiling and fixing errors. Demo fails.	Too tight — no recovery room.
5	Often enough for a one-class app. Two files + compile + fix = 4–5 turns.	Acceptable for very simple cases.
10 (course default)	Comfortable for the canonical demo (1 main + 1 helper class + compile + fix once).	Sweet spot for an 8B model on small tasks.
30	Allows multi-class iteration, several compile/fix loops, exploration.	The model may start looping if the prompt is ambiguous (rereads the same file 5 times).
100+	Unbounded exploration.	A misaligned model can burn tokens running in circles for half an hour. Always pair with a time-out.

MAX_STEPS = 10
for step in range(1, MAX_STEPS + 1):
    response = client.chat(
        model=MODEL_NAME,
        messages=messages,
        tools=tools,
    )
    calls = list(iter_tool_calls(response))
    if not calls:
        break
    for call in calls:
        result = run_tool(call)
        messages.append({"role": "tool", "content": str(result)})

Two operational rules:

MAX_STEPS is a safety net, not a feature. A well-formed task should converge in 3–7 turns. If the agent regularly hits the ceiling, the system prompt or the toolset has a problem — raising the ceiling only postpones the failure.
In production, always pair MAX_STEPS with a wall-clock timeout (e.g. 5 minutes total). A 10-step loop with a model stuck on a very long generation can still take 20 minutes without ever incrementing step.

The system prompt that guides the model

The system prompt lives on lines 39-72 of agent_java.py. It does two things:

defines the role: “You are an autonomous Java development agent. You MUST act through the provided tools”;
lists hard rules: no private, add the java.util.* imports, write a complete file every time (never ...), don’t create empty stubs.

Why so many rules? Because llama3.1:8b is an 8-billion-parameter model: good enough to follow clear instructions, not enough to guess project conventions. Each rule added fixes a real bug observed during demo development.

That’s the topic of chapter 11.

What to watch in class

Run the demo with screen sharing. Ask students to point at the terminal:

Where do we see the model “thinking”? → the --- Step N --- line that starts, just before the first [tool].
Where does the model want to create a file vs actually create it? → the wish: the [tool] write_file -> Foo.java line (what it asked for). The reality: the File created or modified: Foo.java (N lines) line (what your code did).
What happens if compilation fails? → the return of compile_java is appended to messages, the model sees the error on the next turn, calls read_file and write_file to fix.

Test the generated program:

cd workspace
java Main
cd ..

You should see the product list and the total stock value.

Small exercise (5 min)

Edit the DEFAULT_USER_PROMPT in agent_java.py (line ~75) to ask for something else than product management. Ideas:

a tiny bank with Account and Bank;
a FizzBuzz with a FizzBuzz class (don’t dump everything into Main);
a Calculator with four static methods.

Empty the workspace first:

Remove-Item workspace\*.java, workspace\*.class -Force
python agent_java.py

Watch: which tools get called? How many turns? If compilation fails, does the model fix it on its own?

Key takeaways

4 tools, ~30-line loop, one system prompt, and you have a working agent.
The model only requests actions; it’s your Python functions that actually act.
The system-prompt rules compensate for an 8B model’s blind spots (no shortcuts, plenty of guardrails).
This demo is the canonical pedagogical reference used throughout the course to explain what an agent is. Everything else (demo 4, the user interface, per-project isolation) is added structure around the same loop.