Skip to content

Demo 0 — minimal CLI chat

Duration: 10 min Prerequisites: chapter 06 (environment installed)

Source code

Repo: gneuroneai/ollama-demo-0-chat-cli — a single chat.py file (~50 useful lines).

Terminal window
git clone https://github.com/gneuroneai/ollama-demo-0-chat-cli.git
cd ollama-demo-0-chat-cli
.\start.ps1

This project is a minimal command-line chat client written in a single Python file of roughly fifty useful lines. It maintains a Python list of messages with the role / content structure and sends that list to ollama.Client.chat() on every turn; the response is streamed back to the terminal token by token. Running the program allows direct observation of the only data structure that ever travels between the user and the model — an ordered list of typed messages — and exposes, through the /history and /system commands, how a conversation is constructed and how the system prompt fixes the assistant’s behaviour. It serves as the reference implementation that every later demo (Streamlit chat, comparator, agents) merely adds layers around.

Before agents, before UIs, before LangChain — talking to a local LLM is just a Python loop that sends a list of messages to ollama.Client.chat() and reads the streamed response. This demo proves exactly that in ~50 useful lines.

You run:

Terminal window
cd ollama-demo-0-chat-cli
.\start.ps1

And you get a ChatGPT-style chat in the terminal, 100 % local, streamed token by token:

======================================================================
ollama-demo-0-chat-cli - chat with llama3.1:8b (local, via Ollama)
======================================================================
Commands: /clear /system <prompt> /history /quit
Ctrl+C or /quit to exit.
----------------------------------------------------------------------
you> Hi, who are you?
assistant> I am a pedagogical assistant. I can help you with...
you> Can you code?
assistant> Yes, I can write code in several languages...
you> /history
--- 4 messages currently in context ---
[ system] You are a pedagogical assistant...
[ user] Hi, who are you?
[assistant] I am a pedagogical assistant...
[ user] Can you code?
------------------------------------------------------------
you> /quit
Bye.

First launch: ~5 min (model download ~4.7 GB). After that: ~3 seconds.

flowchart TB
  U["<b>You (terminal)</b><br/>type a question"] -->|"input()"| L["<b>Python loop</b><br/>chat.py"]
  L -->|"messages = [system, user, assistant, ...]"| O["<b>ollama.Client.chat()</b><br/>HTTP to 127.0.0.1:11434"]
  O -->|"streamed chunks"| L
  L -->|"print(token, flush=True)"| U
  L -.->|"append assistant"| L
  classDef user fill:#fde68a,stroke:#c2410c,color:#451a03
  classDef code fill:#dbeafe,stroke:#2563eb
  classDef ext fill:#d1fae5,stroke:#047857
  U:::user
  L:::code
  O:::ext
One loop, one message list, one streamed HTTP call. That's it.

chat.py literally does this:

from ollama import Client
client = Client(host="http://127.0.0.1:11434")
messages = [{"role": "system", "content": SYSTEM_PROMPT}]
while True:
user_input = input("you> ")
messages.append({"role": "user", "content": user_input})
full_reply = ""
for chunk in client.chat(model="llama3.1:8b", messages=messages, stream=True):
token = chunk["message"]["content"]
print(token, end="", flush=True)
full_reply += token
messages.append({"role": "assistant", "content": full_reply})

Breakdown:

  • messages = [{"role": "system", ...}] — the message list is the entire state of the conversation. It lives in a Python variable, not in Ollama. You maintain it.
  • client.chat(..., stream=True) — returns a generator of JSON chunks. You loop, you print, you accumulate.
Concrete example: stream=False vs stream=True for the same prompt

Both modes produce the same final content. The only difference is when, and in what shape, the content reaches your code.

ModePython codeWhat the user sees
stream=False
r = client.chat(
model=“llama3.1:8b”,
messages=messages,
stream=False,
)
print(r.message.content)
Silence for ~2 to 5 seconds, then the full answer printed in one go. The Python call blocks until the model has finished generating.
stream=True
full = ""
for chunk in client.chat(
model=“llama3.1:8b”,
messages=messages,
stream=True,
):
token = chunk[“message”][“content”]
print(token, end="", flush=True)
full += token
Tokens appear one by one, ChatGPT-style. The first token arrives almost immediately and the rest scrolls in real time.

Practical implications:

  • CLI usage: streaming is almost always preferable, because the user gets feedback right away.
  • Batch / scripted usage: stream=False is simpler — one call, one full string, no loop.
  • Tool calling: with stream=False you receive a complete tool_calls array. With stream=True you have to accumulate the chunks before deciding whether to execute a tool. This is why demo 3 uses stream=False for its agent loop and demo 0 uses stream=True for chat UX.
  • messages.append({"role": "assistant", ...}) — you append the full reply. Next turn, it goes back to the model. That’s what gives memory.

The remaining ~50 lines only deal with:

  • handling /commands cleanly;
  • printing a banner and readable errors;
  • catching Ctrl+C to exit gracefully.

The whole chat mechanism is in the loop above. Streamlit (demo 1), the comparator (demo 2) and the agents (demos 3 and 4) only add a layer on top.

CommandEffect
/clearClear conversation history (keeps system prompt)
/system <text>Replace the system prompt and clear history
/historyShow the exact list of messages currently in context
/quitExit (Ctrl+C also works)

/history is the most useful pedagogical tool: it physically reveals the Python list that the code sends to Ollama each turn. Nothing hidden. No framework. Just a list.

  1. Run python chat.py.
  2. Ask the model: “How are you?”. The model answers.
  3. Ask: “And you?”. The model answers and refers to its previous turn, which shows that the conversation context has been preserved.
  4. Type /history. The terminal displays the four messages currently held in the Python list and sent to the model.
  5. Type /system You always answer like a pirate. — the history is cleared and the role is replaced.
  6. Ask any question: the model now answers in the new role.

This short sequence demonstrates, without modification of the Python code, that:

An agent = a system prompt + a model. A conversation = a list of messages.

These two statements summarise the lesson and remain valid for the rest of the course.

  • The model has no memory between two calls — the whole conversation is resent every turn. Expensive (each turn rereads history) but it’s what makes the system stateless and easy to debug.
  • Streaming (stream=True) doesn’t change semantics, only UX: you see tokens arrive.
  • The system_prompt is nothing special — it’s just a message with role: "system" at the top of the list. You can change it any time.
  • The model (llama3.1:8b) is swappable: edit MODEL_NAME in chat.py and rerun. See the model library for 10+ alternatives.
You want…Look at…
The same in a browser instead of the terminalDemo 1 — Streamlit chat
Compare three models on the same prompt in parallelDemo 2 — 3-way comparator
Give the model tools (read/write/compile code)Demo 3 — simple CLI agent
  • A conversation = a list of messages. The most important data structure in the entire course.
  • client.chat(model, messages, stream=True) is the single call that runs the model.
  • The system_prompt is just a message with role: "system" — change it and you change the behaviour.
  • Streamlit and the agents are only layers on top of this loop. Understand the loop and the rest becomes obvious.