Local vs commercial models
Duration: 8 min Prerequisites: chapter 04b (ideally) or chapter 04
Key idea
Section titled “Key idea”A commercial model (ChatGPT, Claude, Grok) runs on a company’s servers — you talk to it over the internet and pay per token. A local model (Llama, Qwen, Mistral) runs on your machine — you download the weights once, and you never pay nor depend on the internet again. For this course, we stay local.
The table that sums it up
Section titled “The table that sums it up”| Local model | Commercial model | |
|---|---|---|
| Where compute happens | On your machine (CPU/GPU) | On the provider’s servers |
| How you access it | SDK + Ollama server (localhost:11434) | HTTPS API with a secret key |
| Cost | Free after download | Pay per token (input + output) |
| Internet required | No (except initial pull) | Yes, otherwise nothing works |
| Data sent out | None | Everything you send |
| Speed | Depends on your hardware | Very fast (huge GPUs server-side) |
| Raw quality | Good (8B-14B locally) | Excellent (200B+ models, big GPUs) |
| Updates | When you ollama pull | Automatic, sometimes silent |
| Reproducibility | Stable (same weights forever) | API may change tomorrow |
| Audit / weight inspection | Possible (open source) | Impossible (black box) |
| Examples | Llama 3.1, Qwen2.5, Mistral, Gemma | GPT-5, Claude 4.7, Gemini 2.5, Grok |
Model size: why 2 GB, 8 GB, or 500 GB?
Section titled “Model size: why 2 GB, 8 GB, or 500 GB?”This is the question that surprises everyone. The answer is two words: parameters and quantization.
Parameters = the weights of the network
Section titled “Parameters = the weights of the network”A language model is a big neural network. Each connection between neurons is a parameter (a number). When we say llama3.1:8b, the 8b means 8 billion parameters. The more there are, the “smarter” the model (roughly), the bigger on disk and the more RAM it needs to run.
| Model | Parameters | Disk size (FP16 → Q4) | Runnable on your PC? | Notes |
|---|---|---|---|---|
llama3.2:3b (Meta) | 3 B | ~6 GB → ~2 GB | Yes (4 GB RAM, no GPU) | open weights |
llama3.1:8b (Meta) | 8 B | ~16 GB → ~5 GB | Yes (8 GB RAM, no GPU) | ← used in this course |
qwen2.5-coder:14b (Alibaba) | 14 B | ~28 GB → ~9 GB | Yes (12 GB RAM or 12 GB GPU) | open weights |
llama3.1:70b (Meta) | 70 B | ~140 GB → ~40 GB | No — GPU farm (2-4 H100) | open weights |
| Grok-1 (xAI, March 2024) | 314 B (MoE, 86 active) | ~318 GB | No — multi-GPU 8×80 GB | open Apache 2.0 |
| Grok-2.5 (xAI, Aug 2025) | n/a (FP8 shipped) | ~500 GB | No — 8 × 40 GB VRAM GPUs | open custom licence |
gemma2:9b (Google) | 9 B | ~18 GB → ~6 GB | Yes (8 GB RAM) | open weights |
| GPT-4o / GPT-5 (OpenAI) | not published* | n/a | API only | closed |
| Claude 4.7 (Anthropic) | not published* | n/a | API only | closed |
| Gemini 2.5 Pro (Google) | not published* | n/a | API only | closed |
* Sizes not officially published; press estimates range from a few hundred billion to over a trillion.
Legend for “Runnable on your PC?”:
- Yes: open weights and small enough for a standard PC.
ollama pulland it works. - No (with weights): open weights but too big — you need several professional GPUs (RTX A6000, H100…). Technically possible, not at home.
- No (closed): closed weights — only the hosted API is available. No legal way to download.
For the demo: we use
llama3.1:8b(~5 GB on disk, ~6 GB in RAM at runtime). This is the only model you need to download to run both demos in the repo. The other rows are there just to give orders of magnitude and clarify who can run what on what hardware.
Quantization = the model’s diet
Section titled “Quantization = the model’s diet”Originally, each parameter is a 16-bit floating-point number (FP16) — 2 bytes in memory. For 8 billion parameters that’s 8 × 10⁹ × 2 = 16 GB. Too much for a standard PC’s RAM.
Quantization means storing each parameter on 4 bits instead of 16 (Q4) — the same information with slightly less precision. The model gets ~4× smaller and quality barely drops. That’s why llama3.1:8b is only ~5 GB, not 16 GB, when you ollama pull:
ollama pull llama3.1:8b# pulling manifest...# pulling 8 layers... 4.9 GBOllama ships a Q4_K_M version by default (a balanced quantization scheme). You can choose other levels (:Q8, :fp16) if you have the RAM or want maximum quality.
Concrete examples: the same model under different quantization levels
The table below shows the same llama3.1:8b model under the main quantization levels available on Ollama. The “8b” stays constant (8 billion parameters); only the number of bits used per parameter changes.
| Level | Bits / parameter | File size on disk | RAM at runtime | Quality vs full precision | Typical use |
|---|---|---|---|---|---|
| FP16 | 16 | ~16 GB | ~17 GB | Reference (100%) | Research, when an RTX 4090 or 1× H100 is available |
| Q8_0 | 8 | ~8 GB | ~9 GB | ~99% (loss not detectable in conversation) | Workstations with 16 GB+ RAM, demanding tasks |
| Q4_K_M (Ollama default) | ~4.5 | ~5 GB | ~6 GB | ~96–98% (a few percentage points on benchmarks) | Course default — fits standard laptops |
| Q4_0 | 4 | ~4.5 GB | ~5 GB | ~95% (slight quality drop) | Constrained environments, edge devices |
| Q2_K | ~2.5 | ~3 GB | ~3.5 GB | ~85–90% (noticeable degradation) | Last resort, low-end hardware. Not recommended for code generation. |
ollama pull llama3.1:8b # Q4_K_M default (~5 GB)ollama pull llama3.1:8b-q8_0 # Q8 (~8 GB)ollama pull llama3.1:8b-fp16 # FP16 (~16 GB, needs GPU)Two practical takeaways:
- Q4_K_M is the sweet spot for personal hardware: 3× smaller than FP16, 2-3 percentage points lost on most benchmarks, runs without a GPU. This is why every demo in this course relies on it.
- Below Q4, quality drops visibly, especially on tasks requiring precision (code generation, mathematical reasoning, structured tool calls). For demos 3 and 4 (Java code generation), Q4_K_M is the floor — Q2_K would generate broken code more often than not.
Why Grok, GPT-4 and Claude are gigantic
Section titled “Why Grok, GPT-4 and Claude are gigantic”Top-tier models run on GPU farms: dozens (or hundreds) of A100/H100 cards connected by ultra-fast networking. A single NVIDIA H100 costs ~$30 000, and you typically need 8 to 64 of them to serve a single large model.
Grok case (good for culture): xAI open-sourced the weights of several versions. Grok-1 (March 2024, 314 billion parameters, ~318 GB) under Apache 2.0, and Grok-2.5 (August 2025, ~500 GB in FP8) under a Community licence. So you can download Grok legally — but running it needs 8 GPUs with 40 GB of VRAM each, around $250 000 of hardware. Open source doesn’t mean runnable at home.
GPT-4 / Claude / Gemini case: weights never published. You can only use them via the paid API. No local version is possible, legally or illegally.
You cannot have a GPU farm at home. But here’s the good news: for 95 % of the tasks of a course or a prototype, a well-fine-tuned 7B-14B local model gets the job done. And for the remaining 5 %, you probably shouldn’t be doing it in production with an LLM anyway.
Machine requirements — do you need a GPU?
Section titled “Machine requirements — do you need a GPU?”Short answer: no, a GPU is not required for this course. llama3.1:8b runs on a standard PC without a dedicated graphics card. With a GPU it’s faster, that’s all.
Minimum config (no GPU)
Section titled “Minimum config (no GPU)”| Resource | Minimum | Recommended | Why |
|---|---|---|---|
| RAM | 8 GB | 16 GB | llama3.1:8b quantized Q4 takes ~5–6 GB in memory while running |
| Free disk | 8 GB | 20 GB | 5 GB for the model + 1 GB Ollama + 1 GB Python/venv + 200 MB JDK + headroom |
| CPU | x86_64, 4 recent cores | 8 cores (Ryzen 5/7, Intel i5/i7 recent) | CPU generation = ~5 to 15 tokens/s on 8B Q4 |
| OS | Windows 10/11, macOS 12+, recent Linux | Windows 11 + PowerShell 7 | tested on Windows |
| GPU | none | NVIDIA 8 GB VRAM or Apple Silicon | speeds up ×3 to ×10 depending on the card |
With a GPU (optional but nice)
Section titled “With a GPU (optional but nice)”If you have a recent NVIDIA card (RTX 3060 or newer) with at least 8 GB of VRAM, Ollama automatically loads the model onto it. You have nothing to configure: Ollama detects the GPU at startup and uses it if there’s room.
| GPU type | Ollama support | Does llama3.1:8b fit? | Typical speed |
|---|---|---|---|
| NVIDIA 8 GB VRAM (RTX 3050/3060/4060) | native (CUDA) | yes | 30–60 tokens/s |
| NVIDIA 12+ GB VRAM (RTX 3060 12G, 4070+) | native (CUDA) | yes, plenty of room | 60–100 tokens/s |
| Apple Silicon M1/M2/M3 | native (Metal) | yes (unified memory) | 20–50 tokens/s |
| Recent AMD Radeon (RX 6000+) | partial (ROCm on Linux) | yes | variable |
| Integrated Intel/AMD GPU (no dedicated card) | not used by Ollama | n/a | falls back to CPU |
| No GPU | n/a | runs on CPU | 5–15 tokens/s |
Check what you have
Section titled “Check what you have”# Available RAMGet-CimInstance Win32_ComputerSystem | Select-Object @{N="RAM_GB";E={[math]::Round($_.TotalPhysicalMemory/1GB, 1)}}
# Free disk on C:Get-PSDrive C | Select-Object @{N="Free_GB";E={[math]::Round($_.Free/1GB, 1)}}
# NVIDIA GPU detected?nvidia-smi # if the command exists, you have a working NVIDIA GPU
# Check Ollama sees your GPUollama ps # after a call, shows which device is activeIf you only have 4 GB of RAM
Section titled “If you only have 4 GB of RAM”llama3.1:8b won’t fit. Switch to llama3.2:3b:
ollama pull llama3.2:3bAnd change MODEL_NAME = "llama3.2:3b" in ollama-demo-4-trio-agents-java/agent.py. Quality drops a bit, tool calling stays decent, and it runs. See chapter 05b for the quality/RAM trade-off.
For the classroom
Section titled “For the classroom”For a demo in front of 30 students, the ideal setup is:
- you (the teacher): a PC with an NVIDIA 8 GB VRAM GPU or Apple Silicon → you see generation in real time, it’s more impressive;
- students: 8 GB of RAM is enough, no GPU needed. If some laptops only have 4 GB, they switch to
llama3.2:3b.
All hardware tested during development: Windows 11 laptops, 16 GB RAM, no GPU. It runs. It’s slow (~10 s to generate a Java class), but that slowness is exactly what makes the demo readable — you have time to point at each tool call on screen.
Why we choose local for this course
Section titled “Why we choose local for this course”For a classroom demo, local wins on five dimensions:
1. Zero cost
Section titled “1. Zero cost”A single licence: your initial download time. No token cap, no surprise bill. A class of 30 students can each run the demo at no extra cost.
2. Privacy
Section titled “2. Privacy”You can show the demo with the school’s proprietary code without it leaving the machine. No snippet ends up in OpenAI’s or xAI’s logs. GDPR: no international transfer of data.
3. Works offline
Section titled “3. Works offline”Wifi-less classroom, conference on the metro, demo on a plane: it works. You’re never blocked by “API rate limit exceeded”.
4. Pedagogically transparent
Section titled “4. Pedagogically transparent”You can open ollama-demo-3-agent-java/agent_java.py, point at client = Client(host="http://127.0.0.1:11434"), and tell the student: “look, everything goes through localhost, the model lives in ~/.ollama/models/. There’s the demystification.” With a remote API, everything is in the cloud, untouchable.
5. Reproducible over time
Section titled “5. Reproducible over time”If you re-run the course in 3 years, llama3.1:8b will answer exactly the same as today (identical weights, deterministic at temperature=0). A commercial API changes behind your back: a prompt that worked yesterday may break tomorrow without notice.
When to switch to a commercial model
Section titled “When to switch to a commercial model”Local isn’t the answer to everything. You might prefer a hosted model if:
- you need all the quality available (production, real product);
- you have no GPU and want the speed of an H100;
- you’re building a multi-user service where latency and throughput matter;
- you want advanced features open source doesn’t have yet (very fine multimodal vision, long reasoning, etc.);
- your company already has a cloud-provider contract.
The 2026 best practice: prototype locally, deploy commercial if needed. That’s what this course does — learn everything locally, and you’ll know what to do if you ever need to migrate.
Key takeaways
Section titled “Key takeaways”- Model used in this course:
llama3.1:8b(~5 GB disk, ~6 GB RAM, no GPU needed). - Local model = weights on your machine, free, offline, private.
- Commercial model = remote API, pay per token, internet required, higher quality but opaque.
- Size (2 GB vs 5 GB vs 9 GB vs 40 GB) depends on number of parameters × precision (FP16 → Q4 divides by ~4).
- Ollama ships Q4-quantized versions by default:
llama3.1:8bis ~5 GB instead of 16 GB, with no noticeable loss. - Open source ≠ runnable at home. Grok-1 (March 2024, ~318 GB) and Grok-2.5 (August 2025, ~500 GB) are open but need 8 pro GPUs. GPT-4 / Claude / Gemini stay closed (API only).
- Minimum requirements: 8 GB RAM, 8 GB disk, recent x86_64 CPU. NVIDIA 8+ GB VRAM GPU = bonus (×3 to ×10 faster).
- For this course: we stay local. Cost, privacy, offline, pedagogy, reproducibility.