Fine-tuning a code model
Duration: 15 min Prerequisites: chap 05b (knowing how to pick a model), chap 09-10 (you’ve run the agent demos)
Key idea
Section titled “Key idea”Yes, you can train a code model to speak your style, your conventions, your in-house code. But before you do: 90 % of the time the prompt and RAG are enough — fine-tuning without exhausting those options first is like buying a car when you haven’t tried the bus.
The specialisation spectrum (cheapest to most expensive)
Section titled “The specialisation spectrum (cheapest to most expensive)”| Level | Cost | What it does | When to use it |
|---|---|---|---|
| 1. Prompt engineering | $0 | You give clear instructions (style, format, language). | Always start here. |
| 2. Few-shot in prompt | ~$0 | You stuff 3 to 10 examples into the system prompt. | Precise style/format, light domain vocabulary. |
| 3. RAG (Retrieval-Augmented Generation) | $ | You give the model a search_docs(query) tool that hits a vector DB. | Knowledge that changes often (API docs, live codebase). |
| 4. LoRA fine-tuning | $$ (1 GPU + a few hours) | You change ~1 % of the model’s weights. The rest is frozen. | Fixed style/format, specific vocabulary, latency optimisation. |
| 5. Full fine-tuning | $$$$ (multi-GPU + days) | You change all the weights. | Research, new domain, big budget. Not for the classroom. |
Rule of thumb: if you can solve your problem by editing
SYSTEM_PROMPT(chap 11), do it. If not, try few-shot. If not, RAG. If nothing works, then LoRA.
The 4 cases where LoRA is the right answer
Section titled “The 4 cases where LoRA is the right answer”-
Your team’s style and conventions “Our code uses
snake_case,i18n_as a prefix for any translated text, nevervar, never*in imports.” You can write these 4 rules in the prompt, or you can show the model 1000 examples and it internalises them. -
A very specific domain Z80 assembly, an internal language, a homemade DSL, a proprietary framework GitHub never saw — fine-tune required, the model cannot guess.
-
A strict output format Generating JSON for a precise schema? SQL targeting exactly your 47 tables? LoRA reduces mistakes a lot.
-
Latency vs quality You want the quality of a 14B model with the speed of a 3B? You fine-tune the 3B on the 14B’s outputs (“distillation”). The small model mimics the big one on your domain.
The 3 cases where LoRA is NOT the right answer
Section titled “The 3 cases where LoRA is NOT the right answer”| Symptom | Real solution |
|---|---|
| ”The docs change every day” | RAG — not LoRA. Re-fine-tuning every week is too expensive. |
| ”I have 50 examples” | Few-shot. 50 examples = too few to move weights. |
| ”I want it to be better” | Too vague. Define a metric (e.g. “code compiles 90 % of the time, vs 70 % today”) before touching a GPU. |
GPU: yes, you need one. Realistic table below.
Section titled “GPU: yes, you need one. Realistic table below.”Fine-tuning in practice means LoRA in 4-bit (bnb-4bit): we load the
model in 4 bits to save VRAM, and we only train the LoRA adapters (~1 %
of parameters).
| Available VRAM | Realistic model | Typical duration (1000 ex., 3 epochs) | Indicative price |
|---|---|---|---|
| ≤ 6 GB | Not really doable locally | — | Use Colab |
| 8 GB (RTX 3060 8 GB, RTX 4060) | LoRA qwen2.5-coder:1.5b or :3b in Q4 | 1–2 h | GPU ~$300 |
| 12 GB (RTX 3060 12 GB, RTX 4070) | LoRA qwen2.5-coder:7b in Q4 | 2–3 h | ~$500–700 |
| 16 GB (RTX 4060 Ti 16 GB, RTX 4080) | Comfortable :7b LoRA, tight :14b in Q4 | 1.5–4 h | ~$700–1200 |
| 24 GB (RTX 3090, RTX 4090) | LoRA :14b in Q8, distillation | 1.5–5 h | ~$1500–2000 (used 3090) |
| 40 GB+ (A100, H100) | Anything, including full fine-tune of 7-13B | minutes-hours | Cloud only |
No local GPU? Three options
Section titled “No local GPU? Three options”| Option | Cost | What works |
|---|---|---|
| Google Colab free (T4, 16 GB VRAM) | $0 | LoRA 7B in Q4, ~12 h/day max; sessions of 6 h max |
| Google Colab Pro | ~$10/month | A100 40 GB, longer sessions |
| Runpod / Vast.ai (per-minute GPU) | $0.30–0.80/h for a 4090 | Anything, on demand |
Why not a Mac? Apple Silicon (M1/M2/M3) has huge unified memory, but the standard fine-tuning ecosystem (
bitsandbytes,xformers,unsloth) is CUDA-only. Possible viamlx, but much less documented. For a course, stay on CUDA (Colab or NVIDIA GPU).
Step-by-step: LoRA on qwen2.5-coder:7b with Unsloth
Section titled “Step-by-step: LoRA on qwen2.5-coder:7b with Unsloth”We aim for a light fine-tune on 1000-3000 instruction → answer examples,
exported as GGUF, and loaded into Ollama. Then we reuse all the code
from demos 3 and 4 — a single MODEL_NAME line to change.
Stack used: Unsloth (2×
faster than vanilla Hugging Face, handles 4-bit for you) + datasets +
trl/SFTTrainer.
A. Prepare the data (the longest part)
Section titled “A. Prepare the data (the longest part)”Format: a .jsonl file (one JSON object per line).
{"instruction": "Write a Java class for a 2D Point with equals/hashCode.", "input": "", "output": "public final class Point {\n private final double x, y;\n public Point(double x, double y) { this.x = x; this.y = y; }\n @Override public boolean equals(Object o) { ... }\n @Override public int hashCode() { return Objects.hash(x, y); }\n}"}{"instruction": "Compute the median of an int list in Java.", "input": "", "output": "public static double median(int[] a) { ... }"}Golden rules:
- 500 to 3000 examples MINIMUM to see an effect. Below, do few-shot.
- Quality > quantity: 1000 hand-reviewed examples >> 10000 noisy scraped ones.
- 90 / 10: 90 % in
train.jsonl, 10 % ineval.jsonl. Measure eval to detect overfitting. - No duplicates. The model memorises repeated examples.
- Mix long/short: not only 5-line examples, not only 200-line ones.
Possible sources for 1000 examples:
- Your own pull requests (commit message → diff).
- Your in-house framework docs (docstring → usage example).
- Synthetic: take
qwen2.5-coder:14b(the big one), ask it 1000 questions, hand-review answers, keep 800. That’s distillation.
B. Environment setup
Section titled “B. Environment setup”On Colab (free T4 16 GB), new notebook:
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytesOn your PC with an NVIDIA GPU (CUDA already installed, that’s another topic):
py -m venv .venv.\.venv\Scripts\Activate.ps1pip install torch --index-url https://download.pytorch.org/whl/cu121pip install "unsloth @ git+https://github.com/unslothai/unsloth.git"pip install datasets trl peft accelerate bitsandbytesC. The fine-tuning script (finetune.py)
Section titled “C. The fine-tuning script (finetune.py)”from unsloth import FastLanguageModelfrom datasets import load_datasetfrom trl import SFTTrainerfrom transformers import TrainingArguments
# 1. Load base model in 4-bitmodel, tokenizer = FastLanguageModel.from_pretrained( model_name="unsloth/Qwen2.5-Coder-7B-Instruct-bnb-4bit", max_seq_length=2048, dtype=None, # auto-detect load_in_4bit=True, # massively saves VRAM)
# 2. Add LoRA adapters (~1 % of weights)model = FastLanguageModel.get_peft_model( model, r=16, # LoRA rank — 8 to 32, 16 is the sweet spot lora_alpha=16, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], use_gradient_checkpointing="unsloth",)
# 3. Load and format datadataset = load_dataset("json", data_files="train.jsonl", split="train")
def to_chatml(ex): return {"text": f"<|im_start|>user\n{ex['instruction']}\n{ex.get('input','')}<|im_end|>\n" f"<|im_start|>assistant\n{ex['output']}<|im_end|>" }
dataset = dataset.map(to_chatml)
# 4. Trainertrainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, dataset_text_field="text", max_seq_length=2048, args=TrainingArguments( per_device_train_batch_size=2, gradient_accumulation_steps=4, # effective batch = 8 num_train_epochs=3, # 1-3 max — beyond, it overfits learning_rate=2e-4, # standard for LoRA, don't go higher warmup_ratio=0.03, logging_steps=10, save_strategy="epoch", output_dir="outputs", bf16=True, # or fp16=True on Colab T4 ),)trainer.train()
# 5. Export to GGUF for Ollama (Q4_K_M = good quality/disk ratio)model.save_pretrained_gguf("qwen-mycorp-7b", tokenizer, quantization_method="q4_k_m")Run with python finetune.py. On a free Colab T4, count ~2 h for 1000
examples × 3 epochs.
D. Load the fine-tuned model into Ollama
Section titled “D. Load the fine-tuned model into Ollama”At the end of step C you have a qwen-mycorp-7b/unsloth.Q4_K_M.gguf
file (~4 GB).
Create a Modelfile:
FROM ./qwen-mycorp-7b/unsloth.Q4_K_M.ggufPARAMETER temperature 0.2PARAMETER num_ctx 8192TEMPLATE """{{ if .System }}<|im_start|>system{{ .System }}<|im_end|>{{ end }}{{ if .Prompt }}<|im_start|>user{{ .Prompt }}<|im_end|>{{ end }}<|im_start|>assistant{{ .Response }}<|im_end|>"""SYSTEM """You are an assistant that codes in my team's style."""Then:
ollama create qwen-mycorp:7b -f Modelfileollama run qwen-mycorp:7b# you can now talk to it like any other Ollama modelE. Test inside the repo demos
Section titled “E. Test inside the repo demos”A single line changes. In ollama-demo-3-agent-java/agent_java.py or
ollama-demo-4-trio-agents-java/agent.py:
MODEL_NAME = "qwen-mycorp:7b" # was: "llama3.1:8b"Rerun the demo. Compare:
| Metric | llama3.1:8b (base) | qwen-mycorp:7b (fine-tuned) |
|---|---|---|
Valid tool_calls / total | … | … |
| Code that compiles (over 8 prompts) | … | … |
| Tests passing | … | … |
| House-style conformity (eyeball) | … | … |
Write that table in a README, and that is the “scientific proof” that your fine-tune adds something. Otherwise, back to the prompt.
Classic traps (and how to avoid them)
Section titled “Classic traps (and how to avoid them)”| Trap | Symptom | Fix |
|---|---|---|
| Catastrophic forgetting | The model forgot how to speak English, or how to count. | Low LR (2e-4 max), low epochs (1-3), keep ~5 % “generalist” examples in the dataset. |
| Overfitting | Eval loss rises while train loss falls. The model recites your examples. | More data, or LoRA r=8 instead of 16, or fewer epochs. |
| Broken tool calling | After fine-tune, no more valid tool_call. | Include 100-200 tool-call examples in the dataset (ChatML format with `< |
| Licence | You fine-tune Llama 3.1 for a commercial product without re-reading the licence. | Llama 3.1 has a non-standard licence (forbids very large usage). Qwen 2.5 is Apache 2.0, more permissive. Read LICENSE before commercialising. |
| The prompt template | You use the wrong ChatML template, the model gets confused. | Check the official template on the Ollama Library / HuggingFace page of your base model. |
| No eval | You deploy, “looks good”, two weeks later it crashes in prod. | Always a separate eval set. Measure before/after in numbers, not feeling. |
How much does it really cost? Three scenarios
Section titled “How much does it really cost? Three scenarios”| Scenario | Setup | Total cost | Total time |
|---|---|---|---|
| Student / discovery | Free Colab, 1000 examples, qwen2.5-coder:7b in Q4 | $0 | 1 day (including data prep) |
| Small team | RTX 4060 Ti 16 GB (~$700), 3000 examples, qwen2.5-coder:7b | ~$700 one-shot, then $0/run | 1 week |
| Serious production | RTX 4090 or cloud A100, 10 000 hand-reviewed examples, versioned dataset | $2000–5000 + human time | 1 month |
90 % of fine-tuning time = preparing data. 10 % = code and waiting for the GPU. Invest where it counts.
Mini fine-tuning glossary
Section titled “Mini fine-tuning glossary”| Word | Real meaning |
|---|---|
| LoRA | ”Low-Rank Adaptation”. We add two small matrices alongside the big ones — those are what we train. ~1 % of parameters. |
| PEFT | ”Parameter-Efficient Fine-Tuning”. The family of methods that includes LoRA, QLoRA, etc. |
| QLoRA | LoRA + 4-bit quantization of the base model = LoRA with 1/4 the VRAM. That’s what the script does. |
| bnb-4bit | bitsandbytes in 4 bits: the lib that enables on-the-fly quantization. |
| GGUF | File format optimised for llama.cpp / Ollama. What you load locally after fine-tune. |
| Distillation | Have a big model answer, then fine-tune a small model on those answers. |
| DPO / RLHF | Methods where you feed the model “this answer is better than that one” pairs. Out of scope here. |
| SFT | ”Supervised Fine-Tuning”: what we do in this chapter (instruction → expected answer). |
If you want to go even further
Section titled “If you want to go even further”| You want to… | Look at… |
|---|---|
| An official Unsloth tutorial (ready-to-run notebook) | github.com/unslothai/unsloth |
| Understand QLoRA in depth | QLoRA paper, Dettmers et al. 2023 |
| Automatically evaluate a code model | HumanEval, MBPP |
| Do DPO to tune preferences (not just outputs) | trl.DPOTrainer |
| Deploy your fine-tuned model with something other than Ollama | vllm, text-generation-inference (production) |
Key takeaways
Section titled “Key takeaways”- Before fine-tuning, try prompt → few-shot → RAG. 90 % of problems find their answer there.
- LoRA is what you’re looking for: ~1 % of weights, standard format, reasonable duration.
- GPU mandatory: ≥ 8 GB CUDA VRAM, or free Google Colab (T4 16 GB) otherwise.
- Minimal pipeline:
Unsloth+qwen2.5-coder:7b-bnb-4bit+ ~1000 examples + 1-2 h on a T4 → GGUF model loadable into Ollama. - One line changes in our demos:
MODEL_NAME = "qwen-mycorp:7b". The rest (loop, tools, security) stays identical. - If it doesn’t work, it’s almost always the data: too little, too noisy, or too mono-thematic. Not the fine-tuning code.
- Measure before/after with an eval set, otherwise you’re fine-tuning by feel — and that always ends badly.
Closing words (for real this time)
Section titled “Closing words (for real this time)”The course took you from:
“What is an LLM?” (chap 01)
to:
“I can fine-tune my own code model in my team’s style, deploy it locally with Ollama, and plug it into the agents of demos 3 and 4 without changing the rest of the code.” (chap 14)
That’s plenty for a term-long course, a final project, or kicking off an internal product. Your move.
The model thinks. The tools act. The compiler verifies. The human validates. And now, the model can also speak your dialect.