Source code
Repo: gneuroneai/ollama-demo-2-comparator — app.py ~15 core lines, the rest is Streamlit glue.
git clone https://github.com/gneuroneai/ollama-demo-2-comparator.gitcd ollama-demo-2-comparator.\start.ps1Duration: 10 min Prerequisites: demo 1 (single-model Streamlit chat understood)
Source code
Repo: gneuroneai/ollama-demo-2-comparator — app.py ~15 core lines, the rest is Streamlit glue.
git clone https://github.com/gneuroneai/ollama-demo-2-comparator.gitcd ollama-demo-2-comparator.\start.ps1This project takes the chat loop from demo 1 and runs it three times in parallel: three local models receive the same user prompt and stream their answers side by side into three columns of a Streamlit interface. Each column displays the model name and family, the response in real time, and a short performance summary at the end (elapsed time, estimated number of tokens, throughput in tokens per second). Running the program allows direct comparison of how model size and model family affect the quality, the style and the speed of the response on a single prompt, on the local machine, with no external API call.
Take the Ollama loop from demo 1, multiply it by three in parallel, display it in 3 columns — and make visible in class what is usually only explained on the blackboard: a model’s size and family really change behaviour.
The pattern is simple:
llama3.1:8b, llama3.2:3b, qwen2.5:3b).┌─────────────────────┬─────────────────────┬─────────────────────┐│ [B] llama3.1:8b │ [O] llama3.2:3b │ [G] qwen2.5:3b ││ (Meta, 8B) │ (Meta, 3B) │ (Alibaba, 3B) │├─────────────────────┼─────────────────────┼─────────────────────┤│ user> Explain... │ user> Explain... │ user> Explain... ││ assistant> Here... │ assistant> A... │ assistant> In... ││ (more complete) │ (shorter+faster) │ (different style) │└─────────────────────┴─────────────────────┴─────────────────────┘8b : 4.1 s ≈ 220 tokens (~54 t/s)3b Meta : 1.8 s ≈ 180 tokens (~99 t/s)3b Qwen : 2.0 s ≈ 200 tokens (~100 t/s)Three columns give you three comparisons at once:
| Pair | What it reveals |
|---|---|
| 8B vs 3B (same family, Meta) | Size effect — quality vs speed |
| 3B vs 3B (same size, different families) | Family / training effect — qwen more code-oriented than llama at the same size |
| 8B vs 3B (the two extremes) | Polar comparison |
With two columns you can only pick one axis; with three, you cover both at once.
flowchart TB U["<b>You</b><br/>type a prompt"] L["<b>Round-robin loop</b><br/>app.py"] M1["<b>llama3.1:8b</b><br/>Meta, 8B"] M2["<b>llama3.2:3b</b><br/>Meta, 3B"] M3["<b>qwen2.5:3b</b><br/>Alibaba, 3B"] C1["<b>Column 1</b><br/>stream + metrics"] C2["<b>Column 2</b><br/>stream + metrics"] C3["<b>Column 3</b><br/>stream + metrics"] U --> L L -->|"same prompt"| M1 L -->|"same prompt"| M2 L -->|"same prompt"| M3 M1 -->|"chunks"| C1 M2 -->|"chunks"| C2 M3 -->|"chunks"| C3 classDef user fill:#fde68a,stroke:#c2410c,color:#451a03 classDef code fill:#dbeafe,stroke:#2563eb classDef m1 fill:#bfdbfe,stroke:#1e40af classDef m2 fill:#fed7aa,stroke:#c2410c classDef m3 fill:#bbf7d0,stroke:#047857 classDef out fill:#e0e7ff,stroke:#4338ca U:::user L:::code M1:::m1 M2:::m2 M3:::m3 C1:::out C2:::out C3:::out
The heart of app.py:
gens = [ iter(client.chat(model=m, messages=msgs, stream=True)) for m, msgs in pairs # pairs = [(model_0, msgs_0), (model_1, msgs_1), (model_2, msgs_2)]]bufs = [""] * 3done = [False] * 3
while not all(done): for i in range(3): if done[i]: continue try: chunk = next(gens[i]) bufs[i] += chunk["message"]["content"] placeholders[i].markdown(bufs[i] + " ▌") except StopIteration: done[i] = True placeholders[i].markdown(bufs[i])Three things to notice:
done[i] per column. When one model finishes, its column freezes; the others keep going.The rest of app.py is Streamlit glue: sidebar, histories, metric computation.
cd ollama-demo-2-comparator.\start.ps1 # first run, port 8503; downloads the 3 models if missing.\start.ps1 -SkipPull # download nothing; use what's already local.\start.ps1 -Port 8600 # different portFirst run (downloads the 3 models): ~5-10 min, ~9 GB total. Then nearly instant.
Running three models in parallel needs a bit more RAM than a single-model demo. Note: Ollama loads a model into RAM on demand, so all three aren’t always present — but since we hit them in parallel, in practice they are.
| Combo (3 models) | Peak RAM | Recommendation |
|---|---|---|
llama3.2:3b + qwen2.5:3b + gemma2:2b | ~6 GB | OK on 8 GB RAM |
llama3.1:8b + llama3.2:3b + qwen2.5:3b | ~9 GB | Recommended (default), OK on 16 GB RAM |
llama3.1:8b + qwen2.5:7b + mistral:7b | ~16 GB | Tight on 16 GB, GPU strongly recommended |
llama3.1:8b + qwen2.5-coder:14b + qwen2.5:7b | ~25 GB | 32 GB RAM+, or dedicated GPU |
If your machine struggles, fall back to 3B models. The pedagogical point still lands.
| Prompt | What you see |
|---|---|
| ”Explain recursion in 3 sentences.” | 8B more complete, 3Bs faster; qwen vs llama 3B have different styles |
| ”What is the capital of Bhutan?” | Factual test — start easy, then try lesser-known countries |
| ”Write a Python function that reverses a string.” | qwen2.5 (more code-oriented) tends to produce cleaner code than llama3.2:3b at the same size |
| ”Continue this story: Once upon a time, a dragon…” | Narrative coherence — 8B holds the thread better over 200 words |
| ”How much is 217 × 33?” | Mental math: all three may hallucinate, the 8B is usually less wrong |
| ”Answer only in French. How much is 5+5?” | Instruction-following — which one switches back to English first? |
qwen2.5:3b (more code-friendly training) tends to do better than llama3.2:3b at the same size.“The bigger the model, the more it knows, but the slower and hungrier it is. At equal size, the training family changes behaviour (qwen more code-oriented than llama 3B). The right choice depends on the task. That’s why there’s a dedicated chapter on model choice” — see Choosing a local model.
| You want… | Look at… |
|---|---|
| The single-model chat in the terminal (simpler) | Demo 0 — CLI chat |
| The single-model chat in the browser | Demo 1 — Streamlit chat |
| Understand how to add tools to a model | Demo 3 — CLI agent |
| Three collaborating agents in a richer UI | Demo 4 — three agents |