Fine-tuning a code model

Duration: 15 min Prerequisites: chap 05b (knowing how to pick a model), chap 09-10 (you’ve run the agent demos)

Key idea

Yes, you can train a code model to speak your style, your conventions, your in-house code. But before you do: 90 % of the time the prompt and RAG are enough — fine-tuning without exhausting those options first is like buying a car when you haven’t tried the bus.

The specialisation spectrum (cheapest to most expensive)

Level	Cost	What it does	When to use it
1. Prompt engineering	$0	You give clear instructions (style, format, language).	Always start here.
2. Few-shot in prompt	~$0	You stuff 3 to 10 examples into the system prompt.	Precise style/format, light domain vocabulary.
3. RAG (Retrieval-Augmented Generation)	$	You give the model a `search_docs(query)` tool that hits a vector DB.	Knowledge that changes often (API docs, live codebase).
4. LoRA fine-tuning	$$ (1 GPU + a few hours)	You change ~1 % of the model’s weights. The rest is frozen.	Fixed style/format, specific vocabulary, latency optimisation.
5. Full fine-tuning	$$$$ (multi-GPU + days)	You change all the weights.	Research, new domain, big budget. Not for the classroom.

Rule of thumb: if you can solve your problem by editing SYSTEM_PROMPT (chap 11), do it. If not, try few-shot. If not, RAG. If nothing works, then LoRA.

The 4 cases where LoRA is the right answer

Your team’s style and conventions “Our code uses snake_case, i18n_ as a prefix for any translated text, never var, never * in imports.” You can write these 4 rules in the prompt, or you can show the model 1000 examples and it internalises them.
A very specific domain Z80 assembly, an internal language, a homemade DSL, a proprietary framework GitHub never saw — fine-tune required, the model cannot guess.
A strict output format Generating JSON for a precise schema? SQL targeting exactly your 47 tables? LoRA reduces mistakes a lot.
Latency vs quality You want the quality of a 14B model with the speed of a 3B? You fine-tune the 3B on the 14B’s outputs (“distillation”). The small model mimics the big one on your domain.

The 3 cases where LoRA is NOT the right answer

Symptom	Real solution
”The docs change every day”	RAG — not LoRA. Re-fine-tuning every week is too expensive.
”I have 50 examples”	Few-shot. 50 examples = too few to move weights.
”I want it to be better”	Too vague. Define a metric (e.g. “code compiles 90 % of the time, vs 70 % today”) before touching a GPU.

GPU: yes, you need one. Realistic table below.

Fine-tuning in practice means LoRA in 4-bit (bnb-4bit): we load the model in 4 bits to save VRAM, and we only train the LoRA adapters (~1 % of parameters).

Available VRAM	Realistic model	Typical duration (1000 ex., 3 epochs)	Indicative price
≤ 6 GB	Not really doable locally	—	Use Colab
8 GB (RTX 3060 8 GB, RTX 4060)	LoRA `qwen2.5-coder:1.5b` or `:3b` in Q4	1–2 h	GPU ~$300
12 GB (RTX 3060 12 GB, RTX 4070)	LoRA `qwen2.5-coder:7b` in Q4	2–3 h	~$500–700
16 GB (RTX 4060 Ti 16 GB, RTX 4080)	Comfortable `:7b` LoRA, tight `:14b` in Q4	1.5–4 h	~$700–1200
24 GB (RTX 3090, RTX 4090)	LoRA `:14b` in Q8, distillation	1.5–5 h	~$1500–2000 (used 3090)
40 GB+ (A100, H100)	Anything, including full fine-tune of 7-13B	minutes-hours	Cloud only

No local GPU? Three options

Option	Cost	What works
Google Colab free (T4, 16 GB VRAM)	$0	LoRA 7B in Q4, ~12 h/day max; sessions of 6 h max
Google Colab Pro	~$10/month	A100 40 GB, longer sessions
Runpod / Vast.ai (per-minute GPU)	$0.30–0.80/h for a 4090	Anything, on demand

Why not a Mac? Apple Silicon (M1/M2/M3) has huge unified memory, but the standard fine-tuning ecosystem (bitsandbytes, xformers, unsloth) is CUDA-only. Possible via mlx, but much less documented. For a course, stay on CUDA (Colab or NVIDIA GPU).

Step-by-step: LoRA on `qwen2.5-coder:7b` with Unsloth

We aim for a light fine-tune on 1000-3000 instruction → answer examples, exported as GGUF, and loaded into Ollama. Then we reuse all the code from demos 3 and 4 — a single MODEL_NAME line to change.

Stack used: Unsloth (2× faster than vanilla Hugging Face, handles 4-bit for you) + datasets + trl/SFTTrainer.

A. Prepare the data (the longest part)

Format: a .jsonl file (one JSON object per line).

{"instruction": "Write a Java class for a 2D Point with equals/hashCode.", "input": "", "output": "public final class Point {\n  private final double x, y;\n  public Point(double x, double y) { this.x = x; this.y = y; }\n  @Override public boolean equals(Object o) { ... }\n  @Override public int hashCode() { return Objects.hash(x, y); }\n}"}
{"instruction": "Compute the median of an int list in Java.", "input": "", "output": "public static double median(int[] a) { ... }"}

Golden rules:

500 to 3000 examples MINIMUM to see an effect. Below, do few-shot.
Quality > quantity: 1000 hand-reviewed examples >> 10000 noisy scraped ones.
90 / 10: 90 % in train.jsonl, 10 % in eval.jsonl. Measure eval to detect overfitting.
No duplicates. The model memorises repeated examples.
Mix long/short: not only 5-line examples, not only 200-line ones.

Possible sources for 1000 examples:

Your own pull requests (commit message → diff).
Your in-house framework docs (docstring → usage example).
Synthetic: take qwen2.5-coder:14b (the big one), ask it 1000 questions, hand-review answers, keep 800. That’s distillation.

B. Environment setup

On Colab (free T4 16 GB), new notebook:

!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

On your PC with an NVIDIA GPU (CUDA already installed, that’s another topic):

py -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install "unsloth @ git+https://github.com/unslothai/unsloth.git"
pip install datasets trl peft accelerate bitsandbytes

C. The fine-tuning script (`finetune.py`)

from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments

# 1. Load base model in 4-bit
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit",
    max_seq_length=2048,
    dtype=None,            # auto-detect
    load_in_4bit=True,     # massively saves VRAM
)

# 2. Add LoRA adapters (~1 % of weights)
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                  # LoRA rank — 8 to 32, 16 is the sweet spot
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    use_gradient_checkpointing="unsloth",
)

# 3. Load and format data
dataset = load_dataset("json", data_files="train.jsonl", split="train")

def to_chatml(ex):
    return {"text":
        f"<|im_start|>user\n{ex['instruction']}\n{ex.get('input','')}<|im_end|>\n"
        f"<|im_start|>assistant\n{ex['output']}<|im_end|>"
    }

dataset = dataset.map(to_chatml)

# 4. Trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,    # effective batch = 8
        num_train_epochs=3,               # 1-3 max — beyond, it overfits
        learning_rate=2e-4,               # standard for LoRA, don't go higher
        warmup_ratio=0.03,
        logging_steps=10,
        save_strategy="epoch",
        output_dir="outputs",
        bf16=True,                         # or fp16=True on Colab T4
    ),
)
trainer.train()

# 5. Export to GGUF for Ollama (Q4_K_M = good quality/disk ratio)
model.save_pretrained_gguf("qwen-mycorp-7b", tokenizer, quantization_method="q4_k_m")

Run with python finetune.py. On a free Colab T4, count ~2 h for 1000 examples × 3 epochs.

D. Load the fine-tuned model into Ollama

At the end of step C you have a qwen-mycorp-7b/unsloth.Q4_K_M.gguf file (~4 GB).

Create a Modelfile:

FROM ./qwen-mycorp-7b/unsloth.Q4_K_M.gguf
PARAMETER temperature 0.2
PARAMETER num_ctx 8192
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ .Response }}<|im_end|>
"""
SYSTEM """You are an assistant that codes in my team's style."""

Then:

ollama create qwen-mycorp:7b -f Modelfile
ollama run qwen-mycorp:7b
# you can now talk to it like any other Ollama model

E. Test inside the repo demos

A single line changes. In ollama-demo-3-agent-java/agent_java.py or ollama-demo-4-trio-agents-java/agent.py:

MODEL_NAME = "qwen-mycorp:7b"   # was: "llama3.1:8b"

Rerun the demo. Compare:

Metric	`llama3.1:8b` (base)	`qwen-mycorp:7b` (fine-tuned)
Valid `tool_calls` / total	…	…
Code that compiles (over 8 prompts)	…	…
Tests passing	…	…
House-style conformity (eyeball)	…	…

Write that table in a README, and that is the “scientific proof” that your fine-tune adds something. Otherwise, back to the prompt.

Classic traps (and how to avoid them)

Trap	Symptom	Fix
Catastrophic forgetting	The model forgot how to speak English, or how to count.	Low LR (`2e-4` max), low epochs (1-3), keep ~5 % “generalist” examples in the dataset.
Overfitting	Eval loss rises while train loss falls. The model recites your examples.	More data, or LoRA `r=8` instead of 16, or fewer epochs.
Broken tool calling	After fine-tune, no more valid `tool_call`.	Include 100-200 tool-call examples in the dataset (ChatML format with `<
Licence	You fine-tune Llama 3.1 for a commercial product without re-reading the licence.	Llama 3.1 has a non-standard licence (forbids very large usage). Qwen 2.5 is Apache 2.0, more permissive. Read LICENSE before commercialising.
The prompt template	You use the wrong ChatML template, the model gets confused.	Check the official template on the Ollama Library / HuggingFace page of your base model.
No eval	You deploy, “looks good”, two weeks later it crashes in prod.	Always a separate eval set. Measure before/after in numbers, not feeling.

How much does it really cost? Three scenarios

Scenario	Setup	Total cost	Total time
Student / discovery	Free Colab, 1000 examples, qwen2.5-coder:7b in Q4	$0	1 day (including data prep)
Small team	RTX 4060 Ti 16 GB (~$700), 3000 examples, qwen2.5-coder:7b	~$700 one-shot, then $0/run	1 week
Serious production	RTX 4090 or cloud A100, 10 000 hand-reviewed examples, versioned dataset	$2000–5000 + human time	1 month

90 % of fine-tuning time = preparing data. 10 % = code and waiting for the GPU. Invest where it counts.

Mini fine-tuning glossary

Word	Real meaning
LoRA	”Low-Rank Adaptation”. We add two small matrices alongside the big ones — those are what we train. ~1 % of parameters.
PEFT	”Parameter-Efficient Fine-Tuning”. The family of methods that includes LoRA, QLoRA, etc.
QLoRA	LoRA + 4-bit quantization of the base model = LoRA with 1/4 the VRAM. That’s what the script does.
bnb-4bit	`bitsandbytes` in 4 bits: the lib that enables on-the-fly quantization.
GGUF	File format optimised for `llama.cpp` / Ollama. What you load locally after fine-tune.
Distillation	Have a big model answer, then fine-tune a small model on those answers.
DPO / RLHF	Methods where you feed the model “this answer is better than that one” pairs. Out of scope here.
SFT	”Supervised Fine-Tuning”: what we do in this chapter (instruction → expected answer).

If you want to go even further

You want to…	Look at…
An official Unsloth tutorial (ready-to-run notebook)	github.com/unslothai/unsloth
Understand QLoRA in depth	QLoRA paper, Dettmers et al. 2023
Automatically evaluate a code model	HumanEval, MBPP
Do DPO to tune preferences (not just outputs)	`trl.DPOTrainer`
Deploy your fine-tuned model with something other than Ollama	`vllm`, `text-generation-inference` (production)

Key takeaways

Before fine-tuning, try prompt → few-shot → RAG. 90 % of problems find their answer there.
LoRA is what you’re looking for: ~1 % of weights, standard format, reasonable duration.
GPU mandatory: ≥ 8 GB CUDA VRAM, or free Google Colab (T4 16 GB) otherwise.
Minimal pipeline: Unsloth + qwen2.5-coder:7b-bnb-4bit + ~1000 examples + 1-2 h on a T4 → GGUF model loadable into Ollama.
One line changes in our demos: MODEL_NAME = "qwen-mycorp:7b". The rest (loop, tools, security) stays identical.
If it doesn’t work, it’s almost always the data: too little, too noisy, or too mono-thematic. Not the fine-tuning code.
Measure before/after with an eval set, otherwise you’re fine-tuning by feel — and that always ends badly.

Closing words (for real this time)

The course took you from:

“What is an LLM?” (chap 01)

to:

“I can fine-tune my own code model in my team’s style, deploy it locally with Ollama, and plug it into the agents of demos 3 and 4 without changing the rest of the code.” (chap 14)

That’s plenty for a term-long course, a final project, or kicking off an internal product. Your move.

The model thinks. The tools act. The compiler verifies. The human validates. And now, the model can also speak your dialect.