Skip to content

Fine-tuning a code model

Duration: 15 min Prerequisites: chap 05b (knowing how to pick a model), chap 09-10 (you’ve run the agent demos)

Yes, you can train a code model to speak your style, your conventions, your in-house code. But before you do: 90 % of the time the prompt and RAG are enough — fine-tuning without exhausting those options first is like buying a car when you haven’t tried the bus.


The specialisation spectrum (cheapest to most expensive)

Section titled “The specialisation spectrum (cheapest to most expensive)”
LevelCostWhat it doesWhen to use it
1. Prompt engineering$0You give clear instructions (style, format, language).Always start here.
2. Few-shot in prompt~$0You stuff 3 to 10 examples into the system prompt.Precise style/format, light domain vocabulary.
3. RAG (Retrieval-Augmented Generation)$You give the model a search_docs(query) tool that hits a vector DB.Knowledge that changes often (API docs, live codebase).
4. LoRA fine-tuning$$ (1 GPU + a few hours)You change ~1 % of the model’s weights. The rest is frozen.Fixed style/format, specific vocabulary, latency optimisation.
5. Full fine-tuning$$$$ (multi-GPU + days)You change all the weights.Research, new domain, big budget. Not for the classroom.

Rule of thumb: if you can solve your problem by editing SYSTEM_PROMPT (chap 11), do it. If not, try few-shot. If not, RAG. If nothing works, then LoRA.


The 4 cases where LoRA is the right answer

Section titled “The 4 cases where LoRA is the right answer”
  1. Your team’s style and conventions “Our code uses snake_case, i18n_ as a prefix for any translated text, never var, never * in imports.” You can write these 4 rules in the prompt, or you can show the model 1000 examples and it internalises them.

  2. A very specific domain Z80 assembly, an internal language, a homemade DSL, a proprietary framework GitHub never saw — fine-tune required, the model cannot guess.

  3. A strict output format Generating JSON for a precise schema? SQL targeting exactly your 47 tables? LoRA reduces mistakes a lot.

  4. Latency vs quality You want the quality of a 14B model with the speed of a 3B? You fine-tune the 3B on the 14B’s outputs (“distillation”). The small model mimics the big one on your domain.

The 3 cases where LoRA is NOT the right answer

Section titled “The 3 cases where LoRA is NOT the right answer”
SymptomReal solution
”The docs change every day”RAG — not LoRA. Re-fine-tuning every week is too expensive.
”I have 50 examples”Few-shot. 50 examples = too few to move weights.
”I want it to be betterToo vague. Define a metric (e.g. “code compiles 90 % of the time, vs 70 % today”) before touching a GPU.

GPU: yes, you need one. Realistic table below.

Section titled “GPU: yes, you need one. Realistic table below.”

Fine-tuning in practice means LoRA in 4-bit (bnb-4bit): we load the model in 4 bits to save VRAM, and we only train the LoRA adapters (~1 % of parameters).

Available VRAMRealistic modelTypical duration (1000 ex., 3 epochs)Indicative price
≤ 6 GBNot really doable locallyUse Colab
8 GB (RTX 3060 8 GB, RTX 4060)LoRA qwen2.5-coder:1.5b or :3b in Q41–2 hGPU ~$300
12 GB (RTX 3060 12 GB, RTX 4070)LoRA qwen2.5-coder:7b in Q42–3 h~$500–700
16 GB (RTX 4060 Ti 16 GB, RTX 4080)Comfortable :7b LoRA, tight :14b in Q41.5–4 h~$700–1200
24 GB (RTX 3090, RTX 4090)LoRA :14b in Q8, distillation1.5–5 h~$1500–2000 (used 3090)
40 GB+ (A100, H100)Anything, including full fine-tune of 7-13Bminutes-hoursCloud only
OptionCostWhat works
Google Colab free (T4, 16 GB VRAM)$0LoRA 7B in Q4, ~12 h/day max; sessions of 6 h max
Google Colab Pro~$10/monthA100 40 GB, longer sessions
Runpod / Vast.ai (per-minute GPU)$0.30–0.80/h for a 4090Anything, on demand

Why not a Mac? Apple Silicon (M1/M2/M3) has huge unified memory, but the standard fine-tuning ecosystem (bitsandbytes, xformers, unsloth) is CUDA-only. Possible via mlx, but much less documented. For a course, stay on CUDA (Colab or NVIDIA GPU).


Step-by-step: LoRA on qwen2.5-coder:7b with Unsloth

Section titled “Step-by-step: LoRA on qwen2.5-coder:7b with Unsloth”

We aim for a light fine-tune on 1000-3000 instruction → answer examples, exported as GGUF, and loaded into Ollama. Then we reuse all the code from demos 3 and 4 — a single MODEL_NAME line to change.

Stack used: Unsloth (2× faster than vanilla Hugging Face, handles 4-bit for you) + datasets + trl/SFTTrainer.

Format: a .jsonl file (one JSON object per line).

{"instruction": "Write a Java class for a 2D Point with equals/hashCode.", "input": "", "output": "public final class Point {\n private final double x, y;\n public Point(double x, double y) { this.x = x; this.y = y; }\n @Override public boolean equals(Object o) { ... }\n @Override public int hashCode() { return Objects.hash(x, y); }\n}"}
{"instruction": "Compute the median of an int list in Java.", "input": "", "output": "public static double median(int[] a) { ... }"}

Golden rules:

  • 500 to 3000 examples MINIMUM to see an effect. Below, do few-shot.
  • Quality > quantity: 1000 hand-reviewed examples >> 10000 noisy scraped ones.
  • 90 / 10: 90 % in train.jsonl, 10 % in eval.jsonl. Measure eval to detect overfitting.
  • No duplicates. The model memorises repeated examples.
  • Mix long/short: not only 5-line examples, not only 200-line ones.

Possible sources for 1000 examples:

  • Your own pull requests (commit message → diff).
  • Your in-house framework docs (docstring → usage example).
  • Synthetic: take qwen2.5-coder:14b (the big one), ask it 1000 questions, hand-review answers, keep 800. That’s distillation.

On Colab (free T4 16 GB), new notebook:

Terminal window
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

On your PC with an NVIDIA GPU (CUDA already installed, that’s another topic):

Terminal window
py -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install "unsloth @ git+https://github.com/unslothai/unsloth.git"
pip install datasets trl peft accelerate bitsandbytes
from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments
# 1. Load base model in 4-bit
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit",
max_seq_length=2048,
dtype=None, # auto-detect
load_in_4bit=True, # massively saves VRAM
)
# 2. Add LoRA adapters (~1 % of weights)
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank — 8 to 32, 16 is the sweet spot
lora_alpha=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
use_gradient_checkpointing="unsloth",
)
# 3. Load and format data
dataset = load_dataset("json", data_files="train.jsonl", split="train")
def to_chatml(ex):
return {"text":
f"<|im_start|>user\n{ex['instruction']}\n{ex.get('input','')}<|im_end|>\n"
f"<|im_start|>assistant\n{ex['output']}<|im_end|>"
}
dataset = dataset.map(to_chatml)
# 4. Trainer
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4, # effective batch = 8
num_train_epochs=3, # 1-3 max — beyond, it overfits
learning_rate=2e-4, # standard for LoRA, don't go higher
warmup_ratio=0.03,
logging_steps=10,
save_strategy="epoch",
output_dir="outputs",
bf16=True, # or fp16=True on Colab T4
),
)
trainer.train()
# 5. Export to GGUF for Ollama (Q4_K_M = good quality/disk ratio)
model.save_pretrained_gguf("qwen-mycorp-7b", tokenizer, quantization_method="q4_k_m")

Run with python finetune.py. On a free Colab T4, count ~2 h for 1000 examples × 3 epochs.

At the end of step C you have a qwen-mycorp-7b/unsloth.Q4_K_M.gguf file (~4 GB).

Create a Modelfile:

FROM ./qwen-mycorp-7b/unsloth.Q4_K_M.gguf
PARAMETER temperature 0.2
PARAMETER num_ctx 8192
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ .Response }}<|im_end|>
"""
SYSTEM """You are an assistant that codes in my team's style."""

Then:

Terminal window
ollama create qwen-mycorp:7b -f Modelfile
ollama run qwen-mycorp:7b
# you can now talk to it like any other Ollama model

A single line changes. In ollama-demo-3-agent-java/agent_java.py or ollama-demo-4-trio-agents-java/agent.py:

MODEL_NAME = "qwen-mycorp:7b" # was: "llama3.1:8b"

Rerun the demo. Compare:

Metricllama3.1:8b (base)qwen-mycorp:7b (fine-tuned)
Valid tool_calls / total
Code that compiles (over 8 prompts)
Tests passing
House-style conformity (eyeball)

Write that table in a README, and that is the “scientific proof” that your fine-tune adds something. Otherwise, back to the prompt.


TrapSymptomFix
Catastrophic forgettingThe model forgot how to speak English, or how to count.Low LR (2e-4 max), low epochs (1-3), keep ~5 % “generalist” examples in the dataset.
OverfittingEval loss rises while train loss falls. The model recites your examples.More data, or LoRA r=8 instead of 16, or fewer epochs.
Broken tool callingAfter fine-tune, no more valid tool_call.Include 100-200 tool-call examples in the dataset (ChatML format with `<
LicenceYou fine-tune Llama 3.1 for a commercial product without re-reading the licence.Llama 3.1 has a non-standard licence (forbids very large usage). Qwen 2.5 is Apache 2.0, more permissive. Read LICENSE before commercialising.
The prompt templateYou use the wrong ChatML template, the model gets confused.Check the official template on the Ollama Library / HuggingFace page of your base model.
No evalYou deploy, “looks good”, two weeks later it crashes in prod.Always a separate eval set. Measure before/after in numbers, not feeling.

How much does it really cost? Three scenarios

Section titled “How much does it really cost? Three scenarios”
ScenarioSetupTotal costTotal time
Student / discoveryFree Colab, 1000 examples, qwen2.5-coder:7b in Q4$01 day (including data prep)
Small teamRTX 4060 Ti 16 GB (~$700), 3000 examples, qwen2.5-coder:7b~$700 one-shot, then $0/run1 week
Serious productionRTX 4090 or cloud A100, 10 000 hand-reviewed examples, versioned dataset$2000–5000 + human time1 month

90 % of fine-tuning time = preparing data. 10 % = code and waiting for the GPU. Invest where it counts.


WordReal meaning
LoRA”Low-Rank Adaptation”. We add two small matrices alongside the big ones — those are what we train. ~1 % of parameters.
PEFT”Parameter-Efficient Fine-Tuning”. The family of methods that includes LoRA, QLoRA, etc.
QLoRALoRA + 4-bit quantization of the base model = LoRA with 1/4 the VRAM. That’s what the script does.
bnb-4bitbitsandbytes in 4 bits: the lib that enables on-the-fly quantization.
GGUFFile format optimised for llama.cpp / Ollama. What you load locally after fine-tune.
DistillationHave a big model answer, then fine-tune a small model on those answers.
DPO / RLHFMethods where you feed the model “this answer is better than that one” pairs. Out of scope here.
SFT”Supervised Fine-Tuning”: what we do in this chapter (instruction → expected answer).

You want to…Look at…
An official Unsloth tutorial (ready-to-run notebook)github.com/unslothai/unsloth
Understand QLoRA in depthQLoRA paper, Dettmers et al. 2023
Automatically evaluate a code modelHumanEval, MBPP
Do DPO to tune preferences (not just outputs)trl.DPOTrainer
Deploy your fine-tuned model with something other than Ollamavllm, text-generation-inference (production)

  1. Before fine-tuning, try prompt → few-shot → RAG. 90 % of problems find their answer there.
  2. LoRA is what you’re looking for: ~1 % of weights, standard format, reasonable duration.
  3. GPU mandatory: ≥ 8 GB CUDA VRAM, or free Google Colab (T4 16 GB) otherwise.
  4. Minimal pipeline: Unsloth + qwen2.5-coder:7b-bnb-4bit + ~1000 examples + 1-2 h on a T4 → GGUF model loadable into Ollama.
  5. One line changes in our demos: MODEL_NAME = "qwen-mycorp:7b". The rest (loop, tools, security) stays identical.
  6. If it doesn’t work, it’s almost always the data: too little, too noisy, or too mono-thematic. Not the fine-tuning code.
  7. Measure before/after with an eval set, otherwise you’re fine-tuning by feel — and that always ends badly.

The course took you from:

“What is an LLM?” (chap 01)

to:

“I can fine-tune my own code model in my team’s style, deploy it locally with Ollama, and plug it into the agents of demos 3 and 4 without changing the rest of the code.” (chap 14)

That’s plenty for a term-long course, a final project, or kicking off an internal product. Your move.

The model thinks. The tools act. The compiler verifies. The human validates. And now, the model can also speak your dialect.