Demo 0 — minimal CLI chat

Duration: 10 min Prerequisites: chapter 06 (environment installed) and ideally practice 1 to practice 3 (you have already chatted with the model from a bare terminal)

Source code

Repo: gneuroneai/ollama-demo-0-chat-cli — a single chat.py file (~50 useful lines).

git clone https://github.com/gneuroneai/ollama-demo-0-chat-cli.git
cd ollama-demo-0-chat-cli
.\start.ps1

What this demo is about

This project is a minimal command-line chat client written in a single Python file of roughly fifty useful lines. It maintains a Python list of messages with the role / content structure and sends that list to ollama.Client.chat() on every turn; the response is streamed back to the terminal token by token. Running the program allows direct observation of the only data structure that ever travels between the user and the model — an ordered list of typed messages — and exposes, through the /history and /system commands, how a conversation is constructed and how the system prompt fixes the assistant’s behaviour. It serves as the reference implementation that every later demo (Streamlit chat, comparator, agents) merely adds layers around.

Key idea

Before agents, before UIs, before LangChain — talking to a local LLM is just a Python loop that sends a list of messages to ollama.Client.chat() and reads the streamed response. This demo proves exactly that in ~50 useful lines.

What the demo does

You run:

cd ollama-demo-0-chat-cli
.\start.ps1

And you get a ChatGPT-style chat in the terminal, 100 % local, streamed token by token:

======================================================================
  ollama-demo-0-chat-cli - chat with llama3.1:8b (local, via Ollama)
======================================================================
  Commands: /clear  /system <prompt>  /history  /quit
  Ctrl+C or /quit to exit.
----------------------------------------------------------------------
you> Hi, who are you?
assistant> I am a pedagogical assistant. I can help you with...

you> Can you code?
assistant> Yes, I can write code in several languages...

you> /history
--- 4 messages currently in context ---
  [   system] You are a pedagogical assistant...
  [     user] Hi, who are you?
  [assistant] I am a pedagogical assistant...
  [     user] Can you code?
------------------------------------------------------------

you> /quit
Bye.

First launch: ~5 min (model download ~4.7 GB). After that: ~3 seconds.

Architecture in a diagram

flowchart TB
  U["<b>You (terminal)</b><br/>type a question"] -->|"input()"| L["<b>Python loop</b><br/>chat.py"]
  L -->|"messages = [system, user, assistant, ...]"| O["<b>ollama.Client.chat()</b><br/>HTTP to 127.0.0.1:11434"]
  O -->|"streamed chunks"| L
  L -->|"print(token, flush=True)"| U
  L -.->|"append assistant"| L
  classDef user fill:#fde68a,stroke:#c2410c,color:#451a03
  classDef code fill:#dbeafe,stroke:#2563eb
  classDef ext fill:#d1fae5,stroke:#047857
  U:::user
  L:::code
  O:::ext

One loop, one message list, one streamed HTTP call. That's it.

The core of the code in 5 lines

chat.py literally does this:

from ollama import Client
client = Client(host="http://127.0.0.1:11434")
messages = [{"role": "system", "content": SYSTEM_PROMPT}]

while True:
    user_input = input("you> ")
    messages.append({"role": "user", "content": user_input})
    full_reply = ""
    for chunk in client.chat(model="llama3.1:8b", messages=messages, stream=True):
        token = chunk["message"]["content"]
        print(token, end="", flush=True)
        full_reply += token
    messages.append({"role": "assistant", "content": full_reply})

Breakdown:

messages = [{"role": "system", ...}] — the message list is the entire state of the conversation. It lives in a Python variable, not in Ollama. You maintain it.
client.chat(..., stream=True) — returns a generator of JSON chunks. You loop, you print, you accumulate.

Concrete example: stream=False vs stream=True for the same prompt

Both modes produce the same final content. The only difference is when, and in what shape, the content reaches your code.

Mode	Python code	What the user sees
`stream=False`	r = client.chat( model=“llama3.1:8b”, messages=messages, stream=False, ) print(r.message.content)	Silence for ~2 to 5 seconds, then the full answer printed in one go. The Python call blocks until the model has finished generating.
`stream=True`	full = "" for chunk in client.chat( model=“llama3.1:8b”, messages=messages, stream=True, ): token = chunk[“message”][“content”] print(token, end="", flush=True) full += token	Tokens appear one by one, ChatGPT-style. The first token arrives almost immediately and the rest scrolls in real time.

Practical implications:

CLI usage: streaming is almost always preferable, because the user gets feedback right away.
Batch / scripted usage: stream=False is simpler — one call, one full string, no loop.
Tool calling: with stream=False you receive a complete tool_calls array. With stream=True you have to accumulate the chunks before deciding whether to execute a tool. This is why demo 3 uses stream=False for its agent loop and demo 0 uses stream=True for chat UX.

messages.append({"role": "assistant", ...}) — you append the full reply. Next turn, it goes back to the model. That’s what gives memory.

The remaining ~50 lines only deal with:

handling /commands cleanly;
printing a banner and readable errors;
catching Ctrl+C to exit gracefully.

The whole chat mechanism is in the loop above. Streamlit (demo 1), the comparator (demo 2) and the agents (demos 3 and 4) only add a layer on top.

The slash commands

Command	Effect
`/clear`	Clear conversation history (keeps system prompt)
`/system <text>`	Replace the system prompt and clear history
`/history`	Show the exact list of messages currently in context
`/quit`	Exit (Ctrl+C also works)

/history is the most useful pedagogical tool: it physically reveals the Python list that the code sends to Ollama each turn. Nothing hidden. No framework. Just a list.

Guided walk-through

Run python chat.py.
Ask the model: “How are you?”. The model answers.
Ask: “And you?”. The model answers and refers to its previous turn, which shows that the conversation context has been preserved.
Type /history. The terminal displays the four messages currently held in the Python list and sent to the model.
Type /system You always answer like a pirate. — the history is cleared and the role is replaced.
Ask any question: the model now answers in the new role.

This short sequence demonstrates, without modification of the Python code, that:

An agent = a system prompt + a model. A conversation = a list of messages.

These two statements summarise the lesson and remain valid for the rest of the course.

What to understand before moving on

The model has no memory between two calls — the whole conversation is resent every turn. Expensive (each turn rereads history) but it’s what makes the system stateless and easy to debug.
Streaming (stream=True) doesn’t change semantics, only UX: you see tokens arrive.
The system_prompt is nothing special — it’s just a message with role: "system" at the top of the list. You can change it any time.
The model (llama3.1:8b) is swappable: edit MODEL_NAME in chat.py and rerun. See the model library for 10+ alternatives.

Going further

You want…	Look at…
The same in a browser instead of the terminal	Demo 1 — Streamlit chat
Compare three models on the same prompt in parallel	Demo 2 — 3-way comparator
Give the model tools (read/write/compile code)	Demo 3 — simple CLI agent

Key takeaways

A conversation = a list of messages. The most important data structure in the entire course.
client.chat(model, messages, stream=True) is the single call that runs the model.
The system_prompt is just a message with role: "system" — change it and you change the behaviour.
Streamlit and the agents are only layers on top of this loop. Understand the loop and the rest becomes obvious.