Demo 2 — 3-model side-by-side comparator

Duration: 10 min Prerequisites: demo 1 (single-model Streamlit chat understood)

Source code

Repo: gneuroneai/ollama-demo-2-comparator — app.py ~15 core lines, the rest is Streamlit glue.

git clone https://github.com/gneuroneai/ollama-demo-2-comparator.git
cd ollama-demo-2-comparator
.\start.ps1

What this demo is about

This project takes the chat loop from demo 1 and runs it three times in parallel: three local models receive the same user prompt and stream their answers side by side into three columns of a Streamlit interface. Each column displays the model name and family, the response in real time, and a short performance summary at the end (elapsed time, estimated number of tokens, throughput in tokens per second). Running the program allows direct comparison of how model size and model family affect the quality, the style and the speed of the response on a single prompt, on the local machine, with no external API call.

Key idea

Take the Ollama loop from demo 1, multiply it by three in parallel, display it in 3 columns — and make visible in class what is usually only explained on the blackboard: a model’s size and family really change behaviour.

What the demo does

The pattern is simple:

Pick three models in the sidebar (default: llama3.1:8b, llama3.2:3b, qwen2.5:3b).
Type a prompt at the bottom.
The code sends your prompt to all three models at once.
The three answers stream into three side-by-side columns.
At the end, each column shows: response time, estimated tokens, tokens/second.

┌─────────────────────┬─────────────────────┬─────────────────────┐
│  [B] llama3.1:8b    │  [O] llama3.2:3b    │  [G] qwen2.5:3b     │
│  (Meta, 8B)         │  (Meta, 3B)         │  (Alibaba, 3B)      │
├─────────────────────┼─────────────────────┼─────────────────────┤
│  user> Explain...   │  user> Explain...   │  user> Explain...   │
│  assistant> Here... │  assistant> A...    │  assistant> In...   │
│  (more complete)    │  (shorter+faster)   │  (different style)  │
└─────────────────────┴─────────────────────┴─────────────────────┘
8b       : 4.1 s ≈ 220 tokens (~54 t/s)
3b Meta  : 1.8 s ≈ 180 tokens (~99 t/s)
3b Qwen  : 2.0 s ≈ 200 tokens (~100 t/s)

Why three and not two

Three columns give you three comparisons at once:

Pair	What it reveals
8B vs 3B (same family, Meta)	Size effect — quality vs speed
3B vs 3B (same size, different families)	Family / training effect — qwen more code-oriented than llama at the same size
8B vs 3B (the two extremes)	Polar comparison

With two columns you can only pick one axis; with three, you cover both at once.

Architecture in a diagram

flowchart TB
  U["<b>You</b><br/>type a prompt"]
  L["<b>Round-robin loop</b><br/>app.py"]
  M1["<b>llama3.1:8b</b><br/>Meta, 8B"]
  M2["<b>llama3.2:3b</b><br/>Meta, 3B"]
  M3["<b>qwen2.5:3b</b><br/>Alibaba, 3B"]
  C1["<b>Column 1</b><br/>stream + metrics"]
  C2["<b>Column 2</b><br/>stream + metrics"]
  C3["<b>Column 3</b><br/>stream + metrics"]

  U --> L
  L -->|"same prompt"| M1
  L -->|"same prompt"| M2
  L -->|"same prompt"| M3
  M1 -->|"chunks"| C1
  M2 -->|"chunks"| C2
  M3 -->|"chunks"| C3
  classDef user fill:#fde68a,stroke:#c2410c,color:#451a03
  classDef code fill:#dbeafe,stroke:#2563eb
  classDef m1 fill:#bfdbfe,stroke:#1e40af
  classDef m2 fill:#fed7aa,stroke:#c2410c
  classDef m3 fill:#bbf7d0,stroke:#047857
  classDef out fill:#e0e7ff,stroke:#4338ca
  U:::user
  L:::code
  M1:::m1
  M2:::m2
  M3:::m3
  C1:::out
  C2:::out
  C3:::out

One prompt in, three streams out in parallel. Ollama serves all three requests simultaneously.

The core of the code in 15 lines

The heart of app.py:

gens = [
    iter(client.chat(model=m, messages=msgs, stream=True))
    for m, msgs in pairs        # pairs = [(model_0, msgs_0), (model_1, msgs_1), (model_2, msgs_2)]
]
bufs = [""] * 3
done = [False] * 3

while not all(done):
    for i in range(3):
        if done[i]:
            continue
        try:
            chunk = next(gens[i])
            bufs[i] += chunk["message"]["content"]
            placeholders[i].markdown(bufs[i] + " ▌")
        except StopIteration:
            done[i] = True
            placeholders[i].markdown(bufs[i])

Three things to notice:

Three HTTP requests open at the same time. Ollama serves them in parallel — that’s what makes the comparison fair.
Round-robin over chunks. We read a chunk from M1, then M2, then M3, then back to M1, etc. The three columns grow together.
done[i] per column. When one model finishes, its column freezes; the others keep going.

The rest of app.py is Streamlit glue: sidebar, histories, metric computation.

Launch and ports

cd ollama-demo-2-comparator
.\start.ps1                       # first run, port 8503; downloads the 3 models if missing
.\start.ps1 -SkipPull             # download nothing; use what's already local
.\start.ps1 -Port 8600            # different port

First run (downloads the 3 models): ~5-10 min, ~9 GB total. Then nearly instant.

Hardware requirements

Running three models in parallel needs a bit more RAM than a single-model demo. Note: Ollama loads a model into RAM on demand, so all three aren’t always present — but since we hit them in parallel, in practice they are.

Combo (3 models)	Peak RAM	Recommendation
`llama3.2:3b` + `qwen2.5:3b` + `gemma2:2b`	~6 GB	OK on 8 GB RAM
`llama3.1:8b` + `llama3.2:3b` + `qwen2.5:3b`	~9 GB	Recommended (default), OK on 16 GB RAM
`llama3.1:8b` + `qwen2.5:7b` + `mistral:7b`	~16 GB	Tight on 16 GB, GPU strongly recommended
`llama3.1:8b` + `qwen2.5-coder:14b` + `qwen2.5:7b`	~25 GB	32 GB RAM+, or dedicated GPU

If your machine struggles, fall back to 3B models. The pedagogical point still lands.

What to compare? Six revealing prompts

Prompt	What you see
”Explain recursion in 3 sentences.”	8B more complete, 3Bs faster; qwen vs llama 3B have different styles
”What is the capital of Bhutan?”	Factual test — start easy, then try lesser-known countries
”Write a Python function that reverses a string.”	`qwen2.5` (more code-oriented) tends to produce cleaner code than `llama3.2:3b` at the same size
”Continue this story: Once upon a time, a dragon…”	Narrative coherence — 8B holds the thread better over 200 words
”How much is 217 × 33?”	Mental math: all three may hallucinate, the 8B is usually less wrong
”Answer only in French. How much is 5+5?”	Instruction-following — which one switches back to English first?

Guided walk-through

Launch the demo with the three default models.
Ask a simple question: “Hello, how are you?”. All three answer roughly the same. The 3Bs finish visibly faster.
Ask a medium question: “Give me 3 algorithms to detect duplicates in a list.”. The 8B is more structured; qwen2.5:3b (more code-friendly training) tends to do better than llama3.2:3b at the same size.
Ask a hard question: “Explain why a spring obeys Hooke’s law in terms of atomic structure.”. The 8B stays coherent longer; the 3Bs collapse around the middle.
The closing line:

“The bigger the model, the more it knows, but the slower and hungrier it is. At equal size, the training family changes behaviour (qwen more code-oriented than llama 3B). The right choice depends on the task. That’s why there’s a dedicated chapter on model choice” — see Choosing a local model.

Going further

You want…	Look at…
The single-model chat in the terminal (simpler)	Demo 0 — CLI chat
The single-model chat in the browser	Demo 1 — Streamlit chat
Understand how to add tools to a model	Demo 3 — CLI agent
Three collaborating agents in a richer UI	Demo 4 — three agents

Key takeaways

One prompt, three models, three parallel streams. Ollama multiplexes on its own.
Three columns give you size effect + family effect in a single experiment.
At equal size, the training family matters (qwen vs llama vs gemma).
At equal family, size changes reasoning depth more than production speed.
This demo is the bridge to the chapter Choosing a local model.