Skip to content

Local vs commercial models

Duration: 8 min Prerequisites: chapter 04b (ideally) or chapter 04

A commercial model (ChatGPT, Claude, Grok) runs on a company’s servers — you talk to it over the internet and pay per token. A local model (Llama, Qwen, Mistral) runs on your machine — you download the weights once, and you never pay nor depend on the internet again. For this course, we stay local.


Local modelCommercial model
Where compute happensOn your machine (CPU/GPU)On the provider’s servers
How you access itSDK + Ollama server (localhost:11434)HTTPS API with a secret key
CostFree after downloadPay per token (input + output)
Internet requiredNo (except initial pull)Yes, otherwise nothing works
Data sent outNoneEverything you send
SpeedDepends on your hardwareVery fast (huge GPUs server-side)
Raw qualityGood (8B-14B locally)Excellent (200B+ models, big GPUs)
UpdatesWhen you ollama pullAutomatic, sometimes silent
ReproducibilityStable (same weights forever)API may change tomorrow
Audit / weight inspectionPossible (open source)Impossible (black box)
ExamplesLlama 3.1, Qwen2.5, Mistral, GemmaGPT-5, Claude 4.7, Gemini 2.5, Grok

This is the question that surprises everyone. The answer is two words: parameters and quantization.

A language model is a big neural network. Each connection between neurons is a parameter (a number). When we say llama3.1:8b, the 8b means 8 billion parameters. The more there are, the “smarter” the model (roughly), the bigger on disk and the more RAM it needs to run.

ModelParametersDisk size (FP16 → Q4)Runnable on your PC?Notes
llama3.2:3b (Meta)3 B~6 GB → ~2 GBYes (4 GB RAM, no GPU)open weights
llama3.1:8b (Meta)8 B~16 GB → ~5 GBYes (8 GB RAM, no GPU)← used in this course
qwen2.5-coder:14b (Alibaba)14 B~28 GB → ~9 GBYes (12 GB RAM or 12 GB GPU)open weights
llama3.1:70b (Meta)70 B~140 GB → ~40 GBNo — GPU farm (2-4 H100)open weights
Grok-1 (xAI, March 2024)314 B (MoE, 86 active)~318 GBNo — multi-GPU 8×80 GBopen Apache 2.0
Grok-2.5 (xAI, Aug 2025)n/a (FP8 shipped)~500 GBNo — 8 × 40 GB VRAM GPUsopen custom licence
gemma2:9b (Google)9 B~18 GB → ~6 GBYes (8 GB RAM)open weights
GPT-4o / GPT-5 (OpenAI)not published*n/aAPI onlyclosed
Claude 4.7 (Anthropic)not published*n/aAPI onlyclosed
Gemini 2.5 Pro (Google)not published*n/aAPI onlyclosed

* Sizes not officially published; press estimates range from a few hundred billion to over a trillion.

Legend for “Runnable on your PC?”:

  • Yes: open weights and small enough for a standard PC. ollama pull and it works.
  • No (with weights): open weights but too big — you need several professional GPUs (RTX A6000, H100…). Technically possible, not at home.
  • No (closed): closed weights — only the hosted API is available. No legal way to download.

For the demo: we use llama3.1:8b (~5 GB on disk, ~6 GB in RAM at runtime). This is the only model you need to download to run both demos in the repo. The other rows are there just to give orders of magnitude and clarify who can run what on what hardware.

Originally, each parameter is a 16-bit floating-point number (FP16) — 2 bytes in memory. For 8 billion parameters that’s 8 × 10⁹ × 2 = 16 GB. Too much for a standard PC’s RAM.

Quantization means storing each parameter on 4 bits instead of 16 (Q4) — the same information with slightly less precision. The model gets ~4× smaller and quality barely drops. That’s why llama3.1:8b is only ~5 GB, not 16 GB, when you ollama pull:

Terminal window
ollama pull llama3.1:8b
# pulling manifest...
# pulling 8 layers... 4.9 GB

Ollama ships a Q4_K_M version by default (a balanced quantization scheme). You can choose other levels (:Q8, :fp16) if you have the RAM or want maximum quality.

Concrete examples: the same model under different quantization levels

The table below shows the same llama3.1:8b model under the main quantization levels available on Ollama. The “8b” stays constant (8 billion parameters); only the number of bits used per parameter changes.

LevelBits / parameterFile size on diskRAM at runtimeQuality vs full precisionTypical use
FP1616~16 GB~17 GBReference (100%)Research, when an RTX 4090 or 1× H100 is available
Q8_08~8 GB~9 GB~99% (loss not detectable in conversation)Workstations with 16 GB+ RAM, demanding tasks
Q4_K_M (Ollama default)~4.5~5 GB~6 GB~96–98% (a few percentage points on benchmarks)Course default — fits standard laptops
Q4_04~4.5 GB~5 GB~95% (slight quality drop)Constrained environments, edge devices
Q2_K~2.5~3 GB~3.5 GB~85–90% (noticeable degradation)Last resort, low-end hardware. Not recommended for code generation.
Terminal window
ollama pull llama3.1:8b # Q4_K_M default (~5 GB)
ollama pull llama3.1:8b-q8_0 # Q8 (~8 GB)
ollama pull llama3.1:8b-fp16 # FP16 (~16 GB, needs GPU)

Two practical takeaways:

  1. Q4_K_M is the sweet spot for personal hardware: 3× smaller than FP16, 2-3 percentage points lost on most benchmarks, runs without a GPU. This is why every demo in this course relies on it.
  2. Below Q4, quality drops visibly, especially on tasks requiring precision (code generation, mathematical reasoning, structured tool calls). For demos 3 and 4 (Java code generation), Q4_K_M is the floor — Q2_K would generate broken code more often than not.

Top-tier models run on GPU farms: dozens (or hundreds) of A100/H100 cards connected by ultra-fast networking. A single NVIDIA H100 costs ~$30 000, and you typically need 8 to 64 of them to serve a single large model.

Grok case (good for culture): xAI open-sourced the weights of several versions. Grok-1 (March 2024, 314 billion parameters, ~318 GB) under Apache 2.0, and Grok-2.5 (August 2025, ~500 GB in FP8) under a Community licence. So you can download Grok legally — but running it needs 8 GPUs with 40 GB of VRAM each, around $250 000 of hardware. Open source doesn’t mean runnable at home.

GPT-4 / Claude / Gemini case: weights never published. You can only use them via the paid API. No local version is possible, legally or illegally.

You cannot have a GPU farm at home. But here’s the good news: for 95 % of the tasks of a course or a prototype, a well-fine-tuned 7B-14B local model gets the job done. And for the remaining 5 %, you probably shouldn’t be doing it in production with an LLM anyway.


Machine requirements — do you need a GPU?

Section titled “Machine requirements — do you need a GPU?”

Short answer: no, a GPU is not required for this course. llama3.1:8b runs on a standard PC without a dedicated graphics card. With a GPU it’s faster, that’s all.

ResourceMinimumRecommendedWhy
RAM8 GB16 GBllama3.1:8b quantized Q4 takes ~5–6 GB in memory while running
Free disk8 GB20 GB5 GB for the model + 1 GB Ollama + 1 GB Python/venv + 200 MB JDK + headroom
CPUx86_64, 4 recent cores8 cores (Ryzen 5/7, Intel i5/i7 recent)CPU generation = ~5 to 15 tokens/s on 8B Q4
OSWindows 10/11, macOS 12+, recent LinuxWindows 11 + PowerShell 7tested on Windows
GPUnoneNVIDIA 8 GB VRAM or Apple Siliconspeeds up ×3 to ×10 depending on the card

If you have a recent NVIDIA card (RTX 3060 or newer) with at least 8 GB of VRAM, Ollama automatically loads the model onto it. You have nothing to configure: Ollama detects the GPU at startup and uses it if there’s room.

GPU typeOllama supportDoes llama3.1:8b fit?Typical speed
NVIDIA 8 GB VRAM (RTX 3050/3060/4060)native (CUDA)yes30–60 tokens/s
NVIDIA 12+ GB VRAM (RTX 3060 12G, 4070+)native (CUDA)yes, plenty of room60–100 tokens/s
Apple Silicon M1/M2/M3native (Metal)yes (unified memory)20–50 tokens/s
Recent AMD Radeon (RX 6000+)partial (ROCm on Linux)yesvariable
Integrated Intel/AMD GPU (no dedicated card)not used by Ollaman/afalls back to CPU
No GPUn/aruns on CPU5–15 tokens/s
Terminal window
# Available RAM
Get-CimInstance Win32_ComputerSystem | Select-Object @{N="RAM_GB";E={[math]::Round($_.TotalPhysicalMemory/1GB, 1)}}
# Free disk on C:
Get-PSDrive C | Select-Object @{N="Free_GB";E={[math]::Round($_.Free/1GB, 1)}}
# NVIDIA GPU detected?
nvidia-smi # if the command exists, you have a working NVIDIA GPU
# Check Ollama sees your GPU
ollama ps # after a call, shows which device is active

llama3.1:8b won’t fit. Switch to llama3.2:3b:

Terminal window
ollama pull llama3.2:3b

And change MODEL_NAME = "llama3.2:3b" in ollama-demo-4-trio-agents-java/agent.py. Quality drops a bit, tool calling stays decent, and it runs. See chapter 05b for the quality/RAM trade-off.

For a demo in front of 30 students, the ideal setup is:

  • you (the teacher): a PC with an NVIDIA 8 GB VRAM GPU or Apple Silicon → you see generation in real time, it’s more impressive;
  • students: 8 GB of RAM is enough, no GPU needed. If some laptops only have 4 GB, they switch to llama3.2:3b.

All hardware tested during development: Windows 11 laptops, 16 GB RAM, no GPU. It runs. It’s slow (~10 s to generate a Java class), but that slowness is exactly what makes the demo readable — you have time to point at each tool call on screen.


For a classroom demo, local wins on five dimensions:

A single licence: your initial download time. No token cap, no surprise bill. A class of 30 students can each run the demo at no extra cost.

You can show the demo with the school’s proprietary code without it leaving the machine. No snippet ends up in OpenAI’s or xAI’s logs. GDPR: no international transfer of data.

Wifi-less classroom, conference on the metro, demo on a plane: it works. You’re never blocked by “API rate limit exceeded”.

You can open ollama-demo-3-agent-java/agent_java.py, point at client = Client(host="http://127.0.0.1:11434"), and tell the student: “look, everything goes through localhost, the model lives in ~/.ollama/models/. There’s the demystification.” With a remote API, everything is in the cloud, untouchable.

If you re-run the course in 3 years, llama3.1:8b will answer exactly the same as today (identical weights, deterministic at temperature=0). A commercial API changes behind your back: a prompt that worked yesterday may break tomorrow without notice.


Local isn’t the answer to everything. You might prefer a hosted model if:

  • you need all the quality available (production, real product);
  • you have no GPU and want the speed of an H100;
  • you’re building a multi-user service where latency and throughput matter;
  • you want advanced features open source doesn’t have yet (very fine multimodal vision, long reasoning, etc.);
  • your company already has a cloud-provider contract.

The 2026 best practice: prototype locally, deploy commercial if needed. That’s what this course does — learn everything locally, and you’ll know what to do if you ever need to migrate.


  • Model used in this course: llama3.1:8b (~5 GB disk, ~6 GB RAM, no GPU needed).
  • Local model = weights on your machine, free, offline, private.
  • Commercial model = remote API, pay per token, internet required, higher quality but opaque.
  • Size (2 GB vs 5 GB vs 9 GB vs 40 GB) depends on number of parameters × precision (FP16 → Q4 divides by ~4).
  • Ollama ships Q4-quantized versions by default: llama3.1:8b is ~5 GB instead of 16 GB, with no noticeable loss.
  • Open sourcerunnable at home. Grok-1 (March 2024, ~318 GB) and Grok-2.5 (August 2025, ~500 GB) are open but need 8 pro GPUs. GPT-4 / Claude / Gemini stay closed (API only).
  • Minimum requirements: 8 GB RAM, 8 GB disk, recent x86_64 CPU. NVIDIA 8+ GB VRAM GPU = bonus (×3 to ×10 faster).
  • For this course: we stay local. Cost, privacy, offline, pedagogy, reproducibility.