The open-source LLM landscape

Duration: 10 min Prerequisites: chapter 04 (you’ve understood what an LLM is and how to drive it)

Key idea

“Open-source LLM” is not a single model — it’s a galaxy of families, each with its publisher, licence, architecture and specialty. What distinguishes these models is not the tooling (the tooling is Ollama / LangChain / vLLM), it’s: the licence, the architecture (Dense vs MoE), the modality (text, image, audio, code), and the quality of the initial fine-tuning.

“Is Mistral the best?” Honest answer: there isn’t one best. Mistral excels at multilingual and European MoE; Meta (Llama) remains the reference for reliable tool calling; Alibaba (Qwen) dominates code and multimodal; DeepSeek has become unavoidable for reasoning. It depends on your task. The rest of this chapter gives you the keys to choose.

The 8 major open-source families (May 2026)

Publisher	Model family	Country / lab	Headline innovation	Licence
Meta	Llama (2 → 3.1 → 3.2 → 3.3 → 4)	USA	Native tool calling since 3.1; Llama 3.2 Vision (multimodal); Llama 4 MoE	Llama Community License (open with restrictions)
Mistral AI	Mistral / Mixtral / Codestral / Pixtral / Ministral	France	MoE (Mixtral 8x7B, 8x22B), European multilingual, Pixtral 12B (image)	Apache 2.0 (most)
Google DeepMind	Gemma (1 → 2 → 3)	USA	Very efficient small models, multimodal Gemma 3	Gemma Terms (open but conditional)
Alibaba	Qwen (2 → 2.5 → 3), Qwen-Coder, Qwen-VL, QwQ	China	Qwen2.5-Coder (top on code), Qwen-VL (vision), QwQ (reasoning)	Apache 2.0
DeepSeek	DeepSeek-V3, DeepSeek-R1, DeepSeek-Coder	China	DeepSeek-R1 (open “think” reasoning), MoE 671B (37B active)	MIT (very permissive)
Microsoft	Phi (3, 3.5, 4)	USA	Small models (≤ 14B) that beat 5× bigger models on light reasoning	MIT
xAI	Grok (1 → 1.5+)	USA	Grok-1 = 314B MoE open (largest open weights at release)	Apache 2.0 (Grok-1)
AllenAI	OLMo (1 → 2)	USA (academic)	“Truly” open: weights + data + training recipes	Apache 2.0

What about IBM and Cohere? IBM Granite (Apache 2.0, enterprise-oriented) and Cohere Command (some sizes open, the rest commercial) are serious outsiders but less common in the classroom. Honourable mention for Stability AI (StableLM) and TII (Falcon).

Hugging Face: the hub, not a publisher

Hugging Face does not train its own flagship LLMs (apart from HuggingChat and a few internal projects). It is a platform:

over 1 million models hosted (the “12 000” figure you’ll see in some courses has been outdated since ~2023);
standard file format (safetensors, gguf);
the transformers library that everyone uses;
Datasets, Spaces (one-click deploy), leaderboards (which ranks models on which metric).

For this course: you download models via Ollama (which uses its own mirror), but when you want to fine-tune (chapter 14), you go through Hugging Face. The two worlds coexist.

Yes, there are LLMs for images. And audio. And code.

This is probably the most common confusion. Let’s clear it up.

Image-IN — Vision-Language Models (VLM)

You give an image, the model describes what it sees or answers a question about it.

VLM	Publisher	Size	Available in Ollama
Llama 3.2 Vision	Meta	11B / 90B	`ollama pull llama3.2-vision`
Pixtral	Mistral	12B	`ollama pull pixtral` (when available)
Qwen 2-VL / 2.5-VL	Alibaba	2B / 7B / 72B	`ollama pull qwen2.5vl`
Gemma 3	Google	4B / 12B / 27B	`ollama pull gemma3`
MiniCPM-V	OpenBMB	8B	`ollama pull minicpm-v`

from ollama import Client
client = Client()
resp = client.chat(
    model="llama3.2-vision",
    messages=[{"role": "user",
               "content": "What does this image show?",
               "images": ["./photo.jpg"]}],
)
print(resp["message"]["content"])

Image-OUT — these are NOT LLMs

Stable Diffusion, FLUX, DALL·E, Imagen 3, Midjourney: these are diffusion models, not LLMs. Different architecture, different library (diffusers, ComfyUI). They’re often confused because they’re also “generative” and “AI”, but under the hood there’s no relation.

Audio-IN — Speech-to-Text

ASR	Publisher	Note
Whisper	OpenAI (open!)	De-facto standard, multilingual
Distil-Whisper	HF	6× faster
Qwen2-Audio	Alibaba	Chat by speaking, not just transcription
Voxtral	Mistral	Audio-first (announced 2025)

Audio-OUT — Text-to-Speech

Coqui XTTS, F5-TTS, OuteTTS — these are also not LLMs (dedicated audio architecture), but they are open source and usable locally.

Code

“Code” models are normal LLMs heavily fine-tuned on code. That’s why we have a dedicated chapter on model choice (05b) — for our Java demo, we go with Llama 3.1 (reliable tool calling) or Qwen2.5-Coder (clean Java).

Code model	Publisher	Specialty
Qwen2.5-Coder (1.5B → 32B)	Alibaba	Most versatile and accurate across 40+ languages
Codestral (22B)	Mistral	Very strong on C/C++/Python, specific commercial licence
DeepSeek-Coder (6.7B → 33B)	DeepSeek	Strong on Python/JS, fully free (MIT)
StarCoder2 (3B / 7B / 15B)	BigCode (HF + ServiceNow)	Trained on The Stack v2, transparent about data

Reasoning (“thinking models”)

Recent generation (late 2024 → 2026): models that generate an explicit chain of thought before answering.

Model	Publisher
DeepSeek-R1 + its distill models (Llama-8B-R1, Qwen-7B-R1, …)	DeepSeek
QwQ-32B	Alibaba
Phi-4 reasoning	Microsoft

Cost: these models are slow (5× to 20× more tokens generated because of the visible reasoning) but much better at maths/logic.

Embeddings (for RAG)

Not generative, but central to RAG:

nomic-embed-text (Apache 2.0)
bge-m3 (multilingual, all-in-one)
mxbai-embed-large

ollama pull nomic-embed-text

What distinguishes open-source models from each other

Not the tooling — the tooling (Ollama, vLLM, llama.cpp) is shared by all. The real differences, in practical order of importance:

Axis	Concrete consequence
Licence	Can you use it commercially / redistribute it / embed it in a product? Apache 2.0 / MIT = free. Llama licence = restrictions on very large usage. Gemma = conditional. Always read the `LICENSE` before a professional project.
Architecture	Dense (Llama 3.1 8B = 8B all-active parameters) or MoE (Mixtral 8x7B = 46B total, but 13B active per token → speed of a 13B, quality of a 46B). MoE = better quality/speed ratio, but needs lots of RAM.
Modality	Text-only / + image (VLM) / + audio / + specialised code. No “universal” model — each modality has a cost.
Native tool calling	This is fine-tuned into the model. Llama 3.1+ and Qwen 2.5+ have structured `tool_calls`. Phi, Gemma 2, Mistral 7B need a fallback parser. See chapter 05b.
Specialty	Code / reasoning / multilingual / instruction-following / chat. A “code” model is just a text model heavily fine-tuned on code.
Size	1B / 3B / 7B / 8B / 14B / 32B / 70B / 100B+. Bigger reasons better, but slower and hungrier on RAM/VRAM.
Quantization	The same model exists in FP16 (~2 bytes/weight), Q8 (~1 byte), Q4 (~0.5 byte). Q4_K_M = good compromise for Ollama.
Training data	Influences what it “knows”: The Stack for StarCoder, filtered Common Crawl, GitHub code, etc. Most do not say precisely. OLMo is the exception (everything is public).
Knowledge cutoff	Llama 3.1 = December 2023. Llama 3.3 = mid-2024. Llama 4 = end of 2024. Beyond that, the model does not know. RAG is needed for anything that changes.

Decision tree: which model to pick?

flowchart TD
  Start([What is your need]) --> Modal{Which modality}

  Modal -->|Text only| Task{Task type}
  Modal -->|Text plus image| VLM[/Vision-Language Models<br/>Llama 3.2 Vision<br/>Pixtral 12B<br/>Qwen2.5-VL<br/>Gemma 3/]
  Modal -->|Audio to text| ASR[/Speech-to-Text<br/>Whisper<br/>Distil-Whisper/]
  Modal -->|Code mostly| Code[/Code-specialised<br/>Qwen2.5-Coder<br/>Codestral<br/>DeepSeek-Coder/]
  Modal -->|Image to image| Diff([Diffusion models<br/>Stable Diffusion / FLUX<br/>Out of LLM scope])

  Task -->|Reasoning<br/>math, logic, multi-step| Reason[/Thinking models<br/>DeepSeek-R1<br/>QwQ-32B<br/>Phi-4-reasoning/]
  Task -->|Multilingual<br/>French important| Multi[/Mistral Small or Large<br/>Llama 3.x<br/>Qwen 2.5/]
  Task -->|Reliable tool calling<br/>for an agent| Agents[/Llama 3.1 plus 8B<br/>Qwen 2.5 7B plus<br/>Mistral Large/]
  Task -->|General conversation| HW{Which machine}

  HW -->|8 GB RAM or less<br/>modest laptop| Small[/Small models<br/>Llama 3.2:3b<br/>Gemma 2:2b<br/>Phi-3 mini/]
  HW -->|16 GB RAM<br/>solid laptop| Mid[/Mid-range<br/>Llama 3.1:8b<br/>Qwen 2.5:7b<br/>Mistral 7B/]
  HW -->|Dedicated GPU 24 GB VRAM or more| Big[/Big models<br/>Llama 3.x:70b<br/>Mixtral 8x22B<br/>DeepSeek-V3/]

  Small --> License{Commercial use}
  Mid --> License
  Big --> License
  Agents --> License
  Multi --> License
  Reason --> License
  VLM --> License
  Code --> License

  License -->|Yes unrestricted| FreeLic[/Prefer Apache 2.0 or MIT<br/>Qwen, Mistral OSS,<br/>DeepSeek, Phi, OLMo/]
  License -->|No internal only| AnyLic([Any open model works])

Modality → task → hardware → licence. Three or four candidates at the bottom.

How to read the tree: start at the top, follow your constraints modality → task → hardware → licence. You’ll end up with 3 or 4 candidates. Test, measure, choose. Chapter 05b helps you measure; the demo 2 comparator lets you pit two or three models head-to-head.

”OK, but as of May 2026, who wins?”

Honest answer: nobody wins everywhere. Here’s who dominates each dimension, based on leaderboards and the ecosystem:

Dimension	Dominant family (May 2026)	Why
Reliable local tool calling	Meta (Llama 3.x, Llama 4)	Native `tool_calls` format since Llama 3.1, mature ecosystem.
General-purpose code	Alibaba (Qwen2.5-Coder)	Covers 40+ languages, remarkable Java/Python/Go quality.
Reasoning	DeepSeek (R1 and its distillations)	Redefined what’s expected of an open reasoning model.
European multilingual	Mistral AI	French remains their core market, plus the Mixtral MoE.
Efficient small models (≤ 4B)	Microsoft (Phi-4) + Google (Gemma 3)	Quality / VRAM ratio unbeatable.
Multimodal (image + text)	Google (Gemma 3) + Meta (Llama Vision) + Alibaba (Qwen-VL)	Three families converging, hard to rank.
Truly open (weights + data + recipes)	AllenAI (OLMo 2)	The only one to publish everything. The standard for academic research.
Largest open weights	xAI (Grok) or DeepSeek-V3	314B MoE / 671B MoE — very few users have the hardware to run them locally.

Mistral is NOT necessarily the best at everything. It’s the best at multilingual + MoE + French, which makes it the logical choice for a course in France or Quebec. Elsewhere, the answer changes.

The leaderboard trap

When you read a benchmark saying “model X beats model Y by 2 points on MMLU”, be careful:

Date: a benchmark published 6 months ago is already stale.
Metric: MMLU, HumanEval, GSM8K, MT-Bench, Arena-Hard… do not measure the same thing. A model can dominate MMLU and flop on real agent tasks.
Contamination: some families saw the test sets during training. It’s accidental cheating but very real.
Inference: a model that wins in FP16 on an H100 can collapse in Q4 on your laptop.

The only benchmark that really matters: your own prompts, on your own machine, on your own task. Demo 2 lets you do this in 30 seconds.

Key takeaways

“Open-source LLM” = about 8 publisher families, not a single model. Meta, Mistral, Google, Alibaba, DeepSeek, Microsoft, xAI, AllenAI are the May 2026 pillars.
Mistral isn’t best everywhere: it dominates European multilingual and MoE, period. For tool calling we pick Llama, for code Qwen-Coder, for reasoning DeepSeek-R1.
Hugging Face is not a publisher, it’s the hub where everyone uploads (over a million models today).
Yes, there are LLMs for images (Llama 3.2 Vision, Pixtral, Qwen-VL, Gemma 3). But image generators (Stable Diffusion, FLUX) are not LLMs.
What distinguishes models is NOT the tooling (shared by all via Ollama / vLLM / llama.cpp). The real axes: licence, architecture (Dense/MoE), modality, native tool-calling, specialty, size, quantization, data.
How to choose: follow the decision tree (modality → task → hardware → licence), keep 3 candidates, measure them with demo 2 on your prompts. Chapter 05b walks you through that final step.