Local vs commercial models

Duration: 8 min Prerequisites: chapter 04b (ideally) or chapter 04

Key idea

A commercial model (ChatGPT, Claude, Grok) runs on a company’s servers — you talk to it over the internet and pay per token. A local model (Llama, Qwen, Mistral) runs on your machine — you download the weights once, and you never pay nor depend on the internet again. For this course, we stay local.

The table that sums it up

	Local model	Commercial model
Where compute happens	On your machine (CPU/GPU)	On the provider’s servers
How you access it	SDK + Ollama server (`localhost:11434`)	HTTPS API with a secret key
Cost	Free after download	Pay per token (input + output)
Internet required	No (except initial `pull`)	Yes, otherwise nothing works
Data sent out	None	Everything you send
Speed	Depends on your hardware	Very fast (huge GPUs server-side)
Raw quality	Good (8B-14B locally)	Excellent (200B+ models, big GPUs)
Updates	When you `ollama pull`	Automatic, sometimes silent
Reproducibility	Stable (same weights forever)	API may change tomorrow
Audit / weight inspection	Possible (open source)	Impossible (black box)
Examples	Llama 3.1, Qwen2.5, Mistral, Gemma	GPT-5, Claude 4.7, Gemini 2.5, Grok

Model size: why 2 GB, 8 GB, or 500 GB?

This is the question that surprises everyone. The answer is two words: parameters and quantization.

Parameters = the weights of the network

A language model is a big neural network. Each connection between neurons is a parameter (a number). When we say llama3.1:8b, the 8b means 8 billion parameters. The more there are, the “smarter” the model (roughly), the bigger on disk and the more RAM it needs to run.

Model	Parameters	Disk size (FP16 → Q4)	Runnable on your PC?	Notes
`llama3.2:3b` (Meta)	3 B	~6 GB → ~2 GB	Yes (4 GB RAM, no GPU)	open weights
`llama3.1:8b` (Meta)	8 B	~16 GB → ~5 GB	Yes (8 GB RAM, no GPU)	← used in this course
`qwen2.5-coder:14b` (Alibaba)	14 B	~28 GB → ~9 GB	Yes (12 GB RAM or 12 GB GPU)	open weights
`llama3.1:70b` (Meta)	70 B	~140 GB → ~40 GB	No — GPU farm (2-4 H100)	open weights
Grok-1 (xAI, March 2024)	314 B (MoE, 86 active)	~318 GB	No — multi-GPU 8×80 GB	open Apache 2.0
Grok-2.5 (xAI, Aug 2025)	n/a (FP8 shipped)	~500 GB	No — 8 × 40 GB VRAM GPUs	open custom licence
`gemma2:9b` (Google)	9 B	~18 GB → ~6 GB	Yes (8 GB RAM)	open weights
GPT-4o / GPT-5 (OpenAI)	not published*	n/a	API only	closed
Claude 4.7 (Anthropic)	not published*	n/a	API only	closed
Gemini 2.5 Pro (Google)	not published*	n/a	API only	closed

* Sizes not officially published; press estimates range from a few hundred billion to over a trillion.

Legend for “Runnable on your PC?”:

Yes: open weights and small enough for a standard PC. ollama pull and it works.
No (with weights): open weights but too big — you need several professional GPUs (RTX A6000, H100…). Technically possible, not at home.
No (closed): closed weights — only the hosted API is available. No legal way to download.

For the demo: we use llama3.1:8b (~5 GB on disk, ~6 GB in RAM at runtime). This is the only model you need to download to run both demos in the repo. The other rows are there just to give orders of magnitude and clarify who can run what on what hardware.

Quantization = the model’s diet

Originally, each parameter is a 16-bit floating-point number (FP16) — 2 bytes in memory. For 8 billion parameters that’s 8 × 10⁹ × 2 = 16 GB. Too much for a standard PC’s RAM.

Quantization means storing each parameter on 4 bits instead of 16 (Q4) — the same information with slightly less precision. The model gets ~4× smaller and quality barely drops. That’s why llama3.1:8b is only ~5 GB, not 16 GB, when you ollama pull:

ollama pull llama3.1:8b
# pulling manifest...
# pulling 8 layers... 4.9 GB

Ollama ships a Q4_K_M version by default (a balanced quantization scheme). You can choose other levels (:Q8, :fp16) if you have the RAM or want maximum quality.

Concrete examples: the same model under different quantization levels

The table below shows the same llama3.1:8b model under the main quantization levels available on Ollama. The “8b” stays constant (8 billion parameters); only the number of bits used per parameter changes.

Level	Bits / parameter	File size on disk	RAM at runtime	Quality vs full precision	Typical use
FP16	16	~16 GB	~17 GB	Reference (100%)	Research, when an RTX 4090 or 1× H100 is available
Q8_0	8	~8 GB	~9 GB	~99% (loss not detectable in conversation)	Workstations with 16 GB+ RAM, demanding tasks
Q4_K_M (Ollama default)	~4.5	~5 GB	~6 GB	~96–98% (a few percentage points on benchmarks)	Course default — fits standard laptops
Q4_0	4	~4.5 GB	~5 GB	~95% (slight quality drop)	Constrained environments, edge devices
Q2_K	~2.5	~3 GB	~3.5 GB	~85–90% (noticeable degradation)	Last resort, low-end hardware. Not recommended for code generation.

ollama pull llama3.1:8b           # Q4_K_M default (~5 GB)
ollama pull llama3.1:8b-q8_0      # Q8 (~8 GB)
ollama pull llama3.1:8b-fp16      # FP16 (~16 GB, needs GPU)

Two practical takeaways:

Q4_K_M is the sweet spot for personal hardware: 3× smaller than FP16, 2-3 percentage points lost on most benchmarks, runs without a GPU. This is why every demo in this course relies on it.
Below Q4, quality drops visibly, especially on tasks requiring precision (code generation, mathematical reasoning, structured tool calls). For demos 3 and 4 (Java code generation), Q4_K_M is the floor — Q2_K would generate broken code more often than not.

Why Grok, GPT-4 and Claude are gigantic

Top-tier models run on GPU farms: dozens (or hundreds) of A100/H100 cards connected by ultra-fast networking. A single NVIDIA H100 costs ~$30 000, and you typically need 8 to 64 of them to serve a single large model.

Grok case (good for culture): xAI open-sourced the weights of several versions. Grok-1 (March 2024, 314 billion parameters, ~318 GB) under Apache 2.0, and Grok-2.5 (August 2025, ~500 GB in FP8) under a Community licence. So you can download Grok legally — but running it needs 8 GPUs with 40 GB of VRAM each, around $250 000 of hardware. Open source doesn’t mean runnable at home.

GPT-4 / Claude / Gemini case: weights never published. You can only use them via the paid API. No local version is possible, legally or illegally.

You cannot have a GPU farm at home. But here’s the good news: for 95 % of the tasks of a course or a prototype, a well-fine-tuned 7B-14B local model gets the job done. And for the remaining 5 %, you probably shouldn’t be doing it in production with an LLM anyway.

Machine requirements — do you need a GPU?

Short answer: no, a GPU is not required for this course. llama3.1:8b runs on a standard PC without a dedicated graphics card. With a GPU it’s faster, that’s all.

Minimum config (no GPU)

Resource	Minimum	Recommended	Why
RAM	8 GB	16 GB	`llama3.1:8b` quantized Q4 takes ~5–6 GB in memory while running
Free disk	8 GB	20 GB	5 GB for the model + 1 GB Ollama + 1 GB Python/venv + 200 MB JDK + headroom
CPU	x86_64, 4 recent cores	8 cores (Ryzen 5/7, Intel i5/i7 recent)	CPU generation = ~5 to 15 tokens/s on 8B Q4
OS	Windows 10/11, macOS 12+, recent Linux	Windows 11 + PowerShell 7	tested on Windows
GPU	none	NVIDIA 8 GB VRAM or Apple Silicon	speeds up ×3 to ×10 depending on the card

With a GPU (optional but nice)

If you have a recent NVIDIA card (RTX 3060 or newer) with at least 8 GB of VRAM, Ollama automatically loads the model onto it. You have nothing to configure: Ollama detects the GPU at startup and uses it if there’s room.

GPU type	Ollama support	Does `llama3.1:8b` fit?	Typical speed
NVIDIA 8 GB VRAM (RTX 3050/3060/4060)	native (CUDA)	yes	30–60 tokens/s
NVIDIA 12+ GB VRAM (RTX 3060 12G, 4070+)	native (CUDA)	yes, plenty of room	60–100 tokens/s
Apple Silicon M1/M2/M3	native (Metal)	yes (unified memory)	20–50 tokens/s
Recent AMD Radeon (RX 6000+)	partial (ROCm on Linux)	yes	variable
Integrated Intel/AMD GPU (no dedicated card)	not used by Ollama	n/a	falls back to CPU
No GPU	n/a	runs on CPU	5–15 tokens/s

Check what you have

# Available RAM
Get-CimInstance Win32_ComputerSystem | Select-Object @{N="RAM_GB";E={[math]::Round($_.TotalPhysicalMemory/1GB, 1)}}

# Free disk on C:
Get-PSDrive C | Select-Object @{N="Free_GB";E={[math]::Round($_.Free/1GB, 1)}}

# NVIDIA GPU detected?
nvidia-smi  # if the command exists, you have a working NVIDIA GPU

# Check Ollama sees your GPU
ollama ps  # after a call, shows which device is active

If you only have 4 GB of RAM

llama3.1:8b won’t fit. Switch to llama3.2:3b:

ollama pull llama3.2:3b

And change MODEL_NAME = "llama3.2:3b" in ollama-demo-4-trio-agents-java/agent.py. Quality drops a bit, tool calling stays decent, and it runs. See chapter 05b for the quality/RAM trade-off.

For the classroom

For a demo in front of 30 students, the ideal setup is:

you (the teacher): a PC with an NVIDIA 8 GB VRAM GPU or Apple Silicon → you see generation in real time, it’s more impressive;
students: 8 GB of RAM is enough, no GPU needed. If some laptops only have 4 GB, they switch to llama3.2:3b.

All hardware tested during development: Windows 11 laptops, 16 GB RAM, no GPU. It runs. It’s slow (~10 s to generate a Java class), but that slowness is exactly what makes the demo readable — you have time to point at each tool call on screen.

Why we choose local for this course

For a classroom demo, local wins on five dimensions:

1. Zero cost

A single licence: your initial download time. No token cap, no surprise bill. A class of 30 students can each run the demo at no extra cost.

2. Privacy

You can show the demo with the school’s proprietary code without it leaving the machine. No snippet ends up in OpenAI’s or xAI’s logs. GDPR: no international transfer of data.

3. Works offline

Wifi-less classroom, conference on the metro, demo on a plane: it works. You’re never blocked by “API rate limit exceeded”.

4. Pedagogically transparent

You can open ollama-demo-3-agent-java/agent_java.py, point at client = Client(host="http://127.0.0.1:11434"), and tell the student: “look, everything goes through localhost, the model lives in ~/.ollama/models/. There’s the demystification.” With a remote API, everything is in the cloud, untouchable.

5. Reproducible over time

If you re-run the course in 3 years, llama3.1:8b will answer exactly the same as today (identical weights, deterministic at temperature=0). A commercial API changes behind your back: a prompt that worked yesterday may break tomorrow without notice.

When to switch to a commercial model

Local isn’t the answer to everything. You might prefer a hosted model if:

you need all the quality available (production, real product);
you have no GPU and want the speed of an H100;
you’re building a multi-user service where latency and throughput matter;
you want advanced features open source doesn’t have yet (very fine multimodal vision, long reasoning, etc.);
your company already has a cloud-provider contract.

The 2026 best practice: prototype locally, deploy commercial if needed. That’s what this course does — learn everything locally, and you’ll know what to do if you ever need to migrate.

Key takeaways

Model used in this course: llama3.1:8b (~5 GB disk, ~6 GB RAM, no GPU needed).
Local model = weights on your machine, free, offline, private.
Commercial model = remote API, pay per token, internet required, higher quality but opaque.
Size (2 GB vs 5 GB vs 9 GB vs 40 GB) depends on number of parameters × precision (FP16 → Q4 divides by ~4).
Ollama ships Q4-quantized versions by default: llama3.1:8b is ~5 GB instead of 16 GB, with no noticeable loss.
Open source ≠ runnable at home. Grok-1 (March 2024, ~318 GB) and Grok-2.5 (August 2025, ~500 GB) are open but need 8 pro GPUs. GPT-4 / Claude / Gemini stay closed (API only).
Minimum requirements: 8 GB RAM, 8 GB disk, recent x86_64 CPU. NVIDIA 8+ GB VRAM GPU = bonus (×3 to ×10 faster).
For this course: we stay local. Cost, privacy, offline, pedagogy, reproducibility.