Skip to content

The open-source LLM landscape

Duration: 10 min Prerequisites: chapter 04 (you’ve understood what an LLM is and how to drive it)

“Open-source LLM” is not a single model — it’s a galaxy of families, each with its publisher, licence, architecture and specialty. What distinguishes these models is not the tooling (the tooling is Ollama / LangChain / vLLM), it’s: the licence, the architecture (Dense vs MoE), the modality (text, image, audio, code), and the quality of the initial fine-tuning.

“Is Mistral the best?” Honest answer: there isn’t one best. Mistral excels at multilingual and European MoE; Meta (Llama) remains the reference for reliable tool calling; Alibaba (Qwen) dominates code and multimodal; DeepSeek has become unavoidable for reasoning. It depends on your task. The rest of this chapter gives you the keys to choose.


The 8 major open-source families (May 2026)

Section titled “The 8 major open-source families (May 2026)”
PublisherModel familyCountry / labHeadline innovationLicence
MetaLlama (2 → 3.1 → 3.2 → 3.3 → 4)USANative tool calling since 3.1; Llama 3.2 Vision (multimodal); Llama 4 MoELlama Community License (open with restrictions)
Mistral AIMistral / Mixtral / Codestral / Pixtral / MinistralFranceMoE (Mixtral 8x7B, 8x22B), European multilingual, Pixtral 12B (image)Apache 2.0 (most)
Google DeepMindGemma (1 → 2 → 3)USAVery efficient small models, multimodal Gemma 3Gemma Terms (open but conditional)
AlibabaQwen (2 → 2.5 → 3), Qwen-Coder, Qwen-VL, QwQChinaQwen2.5-Coder (top on code), Qwen-VL (vision), QwQ (reasoning)Apache 2.0
DeepSeekDeepSeek-V3, DeepSeek-R1, DeepSeek-CoderChinaDeepSeek-R1 (open “think” reasoning), MoE 671B (37B active)MIT (very permissive)
MicrosoftPhi (3, 3.5, 4)USASmall models (≤ 14B) that beat 5× bigger models on light reasoningMIT
xAIGrok (1 → 1.5+)USAGrok-1 = 314B MoE open (largest open weights at release)Apache 2.0 (Grok-1)
AllenAIOLMo (1 → 2)USA (academic)“Truly” open: weights + data + training recipesApache 2.0

What about IBM and Cohere? IBM Granite (Apache 2.0, enterprise-oriented) and Cohere Command (some sizes open, the rest commercial) are serious outsiders but less common in the classroom. Honourable mention for Stability AI (StableLM) and TII (Falcon).


Hugging Face does not train its own flagship LLMs (apart from HuggingChat and a few internal projects). It is a platform:

  • over 1 million models hosted (the “12 000” figure you’ll see in some courses has been outdated since ~2023);
  • standard file format (safetensors, gguf);
  • the transformers library that everyone uses;
  • Datasets, Spaces (one-click deploy), leaderboards (which ranks models on which metric).

For this course: you download models via Ollama (which uses its own mirror), but when you want to fine-tune (chapter 14), you go through Hugging Face. The two worlds coexist.


Yes, there are LLMs for images. And audio. And code.

Section titled “Yes, there are LLMs for images. And audio. And code.”

This is probably the most common confusion. Let’s clear it up.

You give an image, the model describes what it sees or answers a question about it.

VLMPublisherSizeAvailable in Ollama
Llama 3.2 VisionMeta11B / 90Bollama pull llama3.2-vision
PixtralMistral12Bollama pull pixtral (when available)
Qwen 2-VL / 2.5-VLAlibaba2B / 7B / 72Bollama pull qwen2.5vl
Gemma 3Google4B / 12B / 27Bollama pull gemma3
MiniCPM-VOpenBMB8Bollama pull minicpm-v
from ollama import Client
client = Client()
resp = client.chat(
model="llama3.2-vision",
messages=[{"role": "user",
"content": "What does this image show?",
"images": ["./photo.jpg"]}],
)
print(resp["message"]["content"])

Stable Diffusion, FLUX, DALL·E, Imagen 3, Midjourney: these are diffusion models, not LLMs. Different architecture, different library (diffusers, ComfyUI). They’re often confused because they’re also “generative” and “AI”, but under the hood there’s no relation.

ASRPublisherNote
WhisperOpenAI (open!)De-facto standard, multilingual
Distil-WhisperHF6× faster
Qwen2-AudioAlibabaChat by speaking, not just transcription
VoxtralMistralAudio-first (announced 2025)

Coqui XTTS, F5-TTS, OuteTTS — these are also not LLMs (dedicated audio architecture), but they are open source and usable locally.

“Code” models are normal LLMs heavily fine-tuned on code. That’s why we have a dedicated chapter on model choice (05b) — for our Java demo, we go with Llama 3.1 (reliable tool calling) or Qwen2.5-Coder (clean Java).

Code modelPublisherSpecialty
Qwen2.5-Coder (1.5B → 32B)AlibabaMost versatile and accurate across 40+ languages
Codestral (22B)MistralVery strong on C/C++/Python, specific commercial licence
DeepSeek-Coder (6.7B → 33B)DeepSeekStrong on Python/JS, fully free (MIT)
StarCoder2 (3B / 7B / 15B)BigCode (HF + ServiceNow)Trained on The Stack v2, transparent about data

Recent generation (late 2024 → 2026): models that generate an explicit chain of thought before answering.

ModelPublisher
DeepSeek-R1 + its distill models (Llama-8B-R1, Qwen-7B-R1, …)DeepSeek
QwQ-32BAlibaba
Phi-4 reasoningMicrosoft

Cost: these models are slow (5× to 20× more tokens generated because of the visible reasoning) but much better at maths/logic.

Not generative, but central to RAG:

  • nomic-embed-text (Apache 2.0)
  • bge-m3 (multilingual, all-in-one)
  • mxbai-embed-large
Terminal window
ollama pull nomic-embed-text

What distinguishes open-source models from each other

Section titled “What distinguishes open-source models from each other”

Not the tooling — the tooling (Ollama, vLLM, llama.cpp) is shared by all. The real differences, in practical order of importance:

AxisConcrete consequence
LicenceCan you use it commercially / redistribute it / embed it in a product? Apache 2.0 / MIT = free. Llama licence = restrictions on very large usage. Gemma = conditional. Always read the LICENSE before a professional project.
ArchitectureDense (Llama 3.1 8B = 8B all-active parameters) or MoE (Mixtral 8x7B = 46B total, but 13B active per token → speed of a 13B, quality of a 46B). MoE = better quality/speed ratio, but needs lots of RAM.
ModalityText-only / + image (VLM) / + audio / + specialised code. No “universal” model — each modality has a cost.
Native tool callingThis is fine-tuned into the model. Llama 3.1+ and Qwen 2.5+ have structured tool_calls. Phi, Gemma 2, Mistral 7B need a fallback parser. See chapter 05b.
SpecialtyCode / reasoning / multilingual / instruction-following / chat. A “code” model is just a text model heavily fine-tuned on code.
Size1B / 3B / 7B / 8B / 14B / 32B / 70B / 100B+. Bigger reasons better, but slower and hungrier on RAM/VRAM.
QuantizationThe same model exists in FP16 (~2 bytes/weight), Q8 (~1 byte), Q4 (~0.5 byte). Q4_K_M = good compromise for Ollama.
Training dataInfluences what it “knows”: The Stack for StarCoder, filtered Common Crawl, GitHub code, etc. Most do not say precisely. OLMo is the exception (everything is public).
Knowledge cutoffLlama 3.1 = December 2023. Llama 3.3 = mid-2024. Llama 4 = end of 2024. Beyond that, the model does not know. RAG is needed for anything that changes.

flowchart TD
  Start([What is your need]) --> Modal{Which modality}

  Modal -->|Text only| Task{Task type}
  Modal -->|Text plus image| VLM[/Vision-Language Models<br/>Llama 3.2 Vision<br/>Pixtral 12B<br/>Qwen2.5-VL<br/>Gemma 3/]
  Modal -->|Audio to text| ASR[/Speech-to-Text<br/>Whisper<br/>Distil-Whisper/]
  Modal -->|Code mostly| Code[/Code-specialised<br/>Qwen2.5-Coder<br/>Codestral<br/>DeepSeek-Coder/]
  Modal -->|Image to image| Diff([Diffusion models<br/>Stable Diffusion / FLUX<br/>Out of LLM scope])

  Task -->|Reasoning<br/>math, logic, multi-step| Reason[/Thinking models<br/>DeepSeek-R1<br/>QwQ-32B<br/>Phi-4-reasoning/]
  Task -->|Multilingual<br/>French important| Multi[/Mistral Small or Large<br/>Llama 3.x<br/>Qwen 2.5/]
  Task -->|Reliable tool calling<br/>for an agent| Agents[/Llama 3.1 plus 8B<br/>Qwen 2.5 7B plus<br/>Mistral Large/]
  Task -->|General conversation| HW{Which machine}

  HW -->|8 GB RAM or less<br/>modest laptop| Small[/Small models<br/>Llama 3.2:3b<br/>Gemma 2:2b<br/>Phi-3 mini/]
  HW -->|16 GB RAM<br/>solid laptop| Mid[/Mid-range<br/>Llama 3.1:8b<br/>Qwen 2.5:7b<br/>Mistral 7B/]
  HW -->|Dedicated GPU 24 GB VRAM or more| Big[/Big models<br/>Llama 3.x:70b<br/>Mixtral 8x22B<br/>DeepSeek-V3/]

  Small --> License{Commercial use}
  Mid --> License
  Big --> License
  Agents --> License
  Multi --> License
  Reason --> License
  VLM --> License
  Code --> License

  License -->|Yes unrestricted| FreeLic[/Prefer Apache 2.0 or MIT<br/>Qwen, Mistral OSS,<br/>DeepSeek, Phi, OLMo/]
  License -->|No internal only| AnyLic([Any open model works])
Modality → task → hardware → licence. Three or four candidates at the bottom.

How to read the tree: start at the top, follow your constraints modality → task → hardware → licence. You’ll end up with 3 or 4 candidates. Test, measure, choose. Chapter 05b helps you measure; the demo 2 comparator lets you pit two or three models head-to-head.


Honest answer: nobody wins everywhere. Here’s who dominates each dimension, based on leaderboards and the ecosystem:

DimensionDominant family (May 2026)Why
Reliable local tool callingMeta (Llama 3.x, Llama 4)Native tool_calls format since Llama 3.1, mature ecosystem.
General-purpose codeAlibaba (Qwen2.5-Coder)Covers 40+ languages, remarkable Java/Python/Go quality.
ReasoningDeepSeek (R1 and its distillations)Redefined what’s expected of an open reasoning model.
European multilingualMistral AIFrench remains their core market, plus the Mixtral MoE.
Efficient small models (≤ 4B)Microsoft (Phi-4) + Google (Gemma 3)Quality / VRAM ratio unbeatable.
Multimodal (image + text)Google (Gemma 3) + Meta (Llama Vision) + Alibaba (Qwen-VL)Three families converging, hard to rank.
Truly open (weights + data + recipes)AllenAI (OLMo 2)The only one to publish everything. The standard for academic research.
Largest open weightsxAI (Grok) or DeepSeek-V3314B MoE / 671B MoE — very few users have the hardware to run them locally.

Mistral is NOT necessarily the best at everything. It’s the best at multilingual + MoE + French, which makes it the logical choice for a course in France or Quebec. Elsewhere, the answer changes.


When you read a benchmark saying “model X beats model Y by 2 points on MMLU”, be careful:

  • Date: a benchmark published 6 months ago is already stale.
  • Metric: MMLU, HumanEval, GSM8K, MT-Bench, Arena-Hard… do not measure the same thing. A model can dominate MMLU and flop on real agent tasks.
  • Contamination: some families saw the test sets during training. It’s accidental cheating but very real.
  • Inference: a model that wins in FP16 on an H100 can collapse in Q4 on your laptop.

The only benchmark that really matters: your own prompts, on your own machine, on your own task. Demo 2 lets you do this in 30 seconds.


  1. “Open-source LLM” = about 8 publisher families, not a single model. Meta, Mistral, Google, Alibaba, DeepSeek, Microsoft, xAI, AllenAI are the May 2026 pillars.
  2. Mistral isn’t best everywhere: it dominates European multilingual and MoE, period. For tool calling we pick Llama, for code Qwen-Coder, for reasoning DeepSeek-R1.
  3. Hugging Face is not a publisher, it’s the hub where everyone uploads (over a million models today).
  4. Yes, there are LLMs for images (Llama 3.2 Vision, Pixtral, Qwen-VL, Gemma 3). But image generators (Stable Diffusion, FLUX) are not LLMs.
  5. What distinguishes models is NOT the tooling (shared by all via Ollama / vLLM / llama.cpp). The real axes: licence, architecture (Dense/MoE), modality, native tool-calling, specialty, size, quantization, data.
  6. How to choose: follow the decision tree (modality → task → hardware → licence), keep 3 candidates, measure them with demo 2 on your prompts. Chapter 05b walks you through that final step.