Source code
Repo: gneuroneai/ollama-demo-0-chat-cli — a single chat.py file (~50 useful lines).
git clone https://github.com/gneuroneai/ollama-demo-0-chat-cli.gitcd ollama-demo-0-chat-cli.\start.ps1Duration: 10 min Prerequisites: chapter 06 (environment installed)
Source code
Repo: gneuroneai/ollama-demo-0-chat-cli — a single chat.py file (~50 useful lines).
git clone https://github.com/gneuroneai/ollama-demo-0-chat-cli.gitcd ollama-demo-0-chat-cli.\start.ps1This project is a minimal command-line chat client written in a single Python file of roughly fifty useful lines. It maintains a Python list of messages with the role / content structure and sends that list to ollama.Client.chat() on every turn; the response is streamed back to the terminal token by token. Running the program allows direct observation of the only data structure that ever travels between the user and the model — an ordered list of typed messages — and exposes, through the /history and /system commands, how a conversation is constructed and how the system prompt fixes the assistant’s behaviour. It serves as the reference implementation that every later demo (Streamlit chat, comparator, agents) merely adds layers around.
Before agents, before UIs, before LangChain — talking to a local LLM is just a Python loop that sends a list of messages to ollama.Client.chat() and reads the streamed response. This demo proves exactly that in ~50 useful lines.
You run:
cd ollama-demo-0-chat-cli.\start.ps1And you get a ChatGPT-style chat in the terminal, 100 % local, streamed token by token:
====================================================================== ollama-demo-0-chat-cli - chat with llama3.1:8b (local, via Ollama)====================================================================== Commands: /clear /system <prompt> /history /quit Ctrl+C or /quit to exit.----------------------------------------------------------------------you> Hi, who are you?assistant> I am a pedagogical assistant. I can help you with...
you> Can you code?assistant> Yes, I can write code in several languages...
you> /history--- 4 messages currently in context --- [ system] You are a pedagogical assistant... [ user] Hi, who are you? [assistant] I am a pedagogical assistant... [ user] Can you code?------------------------------------------------------------
you> /quitBye.First launch: ~5 min (model download ~4.7 GB). After that: ~3 seconds.
flowchart TB U["<b>You (terminal)</b><br/>type a question"] -->|"input()"| L["<b>Python loop</b><br/>chat.py"] L -->|"messages = [system, user, assistant, ...]"| O["<b>ollama.Client.chat()</b><br/>HTTP to 127.0.0.1:11434"] O -->|"streamed chunks"| L L -->|"print(token, flush=True)"| U L -.->|"append assistant"| L classDef user fill:#fde68a,stroke:#c2410c,color:#451a03 classDef code fill:#dbeafe,stroke:#2563eb classDef ext fill:#d1fae5,stroke:#047857 U:::user L:::code O:::ext
chat.py literally does this:
from ollama import Clientclient = Client(host="http://127.0.0.1:11434")messages = [{"role": "system", "content": SYSTEM_PROMPT}]
while True: user_input = input("you> ") messages.append({"role": "user", "content": user_input}) full_reply = "" for chunk in client.chat(model="llama3.1:8b", messages=messages, stream=True): token = chunk["message"]["content"] print(token, end="", flush=True) full_reply += token messages.append({"role": "assistant", "content": full_reply})Breakdown:
messages = [{"role": "system", ...}] — the message list is the entire state of the conversation. It lives in a Python variable, not in Ollama. You maintain it.client.chat(..., stream=True) — returns a generator of JSON chunks. You loop, you print, you accumulate.Both modes produce the same final content. The only difference is when, and in what shape, the content reaches your code.
| Mode | Python code | What the user sees |
|---|---|---|
stream=False | r = client.chat( | Silence for ~2 to 5 seconds, then the full answer printed in one go. The Python call blocks until the model has finished generating. |
stream=True | full = "" | Tokens appear one by one, ChatGPT-style. The first token arrives almost immediately and the rest scrolls in real time. |
Practical implications:
stream=False is simpler — one call, one full string, no loop.stream=False you receive a complete tool_calls array. With stream=True you have to accumulate the chunks before deciding whether to execute a tool. This is why demo 3 uses stream=False for its agent loop and demo 0 uses stream=True for chat UX.messages.append({"role": "assistant", ...}) — you append the full reply. Next turn, it goes back to the model. That’s what gives memory.The remaining ~50 lines only deal with:
/commands cleanly;The whole chat mechanism is in the loop above. Streamlit (demo 1), the comparator (demo 2) and the agents (demos 3 and 4) only add a layer on top.
| Command | Effect |
|---|---|
/clear | Clear conversation history (keeps system prompt) |
/system <text> | Replace the system prompt and clear history |
/history | Show the exact list of messages currently in context |
/quit | Exit (Ctrl+C also works) |
/history is the most useful pedagogical tool: it physically reveals the Python list that the code sends to Ollama each turn. Nothing hidden. No framework. Just a list.
python chat.py./history. The terminal displays the four messages currently held in the Python list and sent to the model./system You always answer like a pirate. — the history is cleared and the role is replaced.This short sequence demonstrates, without modification of the Python code, that:
An agent = a system prompt + a model. A conversation = a list of messages.
These two statements summarise the lesson and remain valid for the rest of the course.
stream=True) doesn’t change semantics, only UX: you see tokens arrive.system_prompt is nothing special — it’s just a message with role: "system" at the top of the list. You can change it any time.llama3.1:8b) is swappable: edit MODEL_NAME in chat.py and rerun. See the model library for 10+ alternatives.| You want… | Look at… |
|---|---|
| The same in a browser instead of the terminal | Demo 1 — Streamlit chat |
| Compare three models on the same prompt in parallel | Demo 2 — 3-way comparator |
| Give the model tools (read/write/compile code) | Demo 3 — simple CLI agent |
client.chat(model, messages, stream=True) is the single call that runs the model.system_prompt is just a message with role: "system" — change it and you change the behaviour.