Day 25 — Practical Fine-Tuning — LoRA / QLoRA, PEFT, Instruction Datasets, DPO
Fine-tuning is back as the way to specialise models for your domain and reduce inference cost. LoRA + QLoRA make it tractable on commodity GPUs; DPO / ORPO have…
Fine-tuning lets you specialise a foundation model for a domain, style or task. It is not for adding raw knowledge (that's RAG); it's for shaping behaviour and format. Modern PEFT methods make this cheap.
🧠 Concept
Why it matters & the mental model.
1. The fine-tuning hierarchy
- Prompting: free, instant, but limited.
- Few-shot in context: better, but costs context tokens.
- RAG: gives knowledge, doesn't change behaviour.
- Fine-tuning (full): changes all weights, expensive (8×A100 for 7B).
- PEFT (LoRA, etc.): trains <1% of params, almost as good for most tasks.
2. LoRA — the trick
For each weight matrix W ∈ R^\{d×k\}, the update during fine-tuning is ΔW = AB where A ∈ R^\{d×r\}, B ∈ R^\{r×k\}, r ≪ d, k. We freeze W, train only A, B. At inference, either compute (W + AB)x or merge W' = W + AB.
The assumption: fine-tuning updates lie in a low-rank subspace of the full weight space. Empirically true: r=8-32 captures most of the gain.
Memory: a 7B model in BF16 = ~14 GB; LoRA adapters at r=16 add < 100 MB. Trainable params drop from 7B → ~50M.
3. QLoRA — adding 4-bit quantisation
Base model loaded in NF4 (normal float 4-bit) — 4× memory cut. LoRA adapters stay in BF16. Result: 7B model fine-tuneable on a 24GB GPU; 70B on a single A100 80GB or 2×24GB with device_map="auto".
Quality cost vs full fine-tune: < 1pt on MMLU-style benchmarks for SFT. For RLHF/DPO the gap is similarly small.
🛠 Deep Dive
Internals, code, architecture.
4. SFT (supervised fine-tuning) recipe
- Pick base model (usually instruct / chat variant of Llama 3, Qwen, Mistral).
- Format dataset as (prompt, completion) with the model's chat template.
- Set:
lr=2e-4(LoRA),batch=4,grad_accum=8,epochs=3,lora_r=16, alpha=32, dropout=0.05, target modules =q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj. - Use
trl.SFTTrainerorunsloth(2-4× faster, less memory). - Eval on held-out before merge; merge with
model.merge_and_unload()if you want a single deployable.
5. Dataset quality > size
1k high-quality examples > 100k noisy ones. Curate ruthlessly:
- De-duplicate (MinHash).
- Filter for response quality (rubric-based or model-based).
- Balance task types and difficulty.
- Hold out a strict eval set (no leakage).
6. Alignment / preference tuning
After SFT, align to human preferences:
- RLHF (PPO): train reward model on (prompt, win, lose), then PPO on policy. Powerful, finicky, GPU-heavy.
- DPO: directly optimises policy on preference pairs with a simple ranking loss — no reward model, no RL loop. Has largely replaced PPO for production.
- ORPO: combines SFT and preference in one loss — even simpler.
- KTO: works with single-label preferences (👍/👎) without pairs.
DPO loss: -log σ(β · [log π(y_w|x) - log π(y_l|x) - log π_ref(y_w|x) + log π_ref(y_l|x)]). β=0.1 typical.
7. When NOT to fine-tune
- Need fresh knowledge → use RAG, retrieval is cheap to update.
- Task is sufficiently solved by prompting + few-shot.
- < 500 examples available → likely overfit or hurt the base model.
- You can't measure success → don't ship unmeasured changes.
🚀 In Practice
Trade-offs, exercises, what to ship today.
8. Catastrophic forgetting
Fine-tuning on a narrow task can degrade general capabilities. Mitigations:
- Mix general instruction data with task data (e.g. 10% Alpaca / Tulu).
- Lower learning rate, fewer epochs.
- LoRA helps (base weights frozen).
9. Serving fine-tuned models
- Merge LoRA into base for single artifact → standard vLLM serving.
- Or keep adapters separate and swap at runtime (vLLM 0.5+ multi-LoRA, BentoML) → one base model serves many tenants.
10. Cost model
QLoRA SFT of 8B on 50k examples ≈ 4-12 GPU hours on A100, ~$10-50 on spot. DPO adds another 4-8 hours. Cheaper than the time it takes to prompt-engineer around the same problem.
11. What to take away
"When would you fine-tune vs RAG?" Strong answers: fine-tune for behaviour/format, RAG for knowledge; PEFT for cost; mention DPO over PPO as modern alignment; cite a real eval before/after.
Resources
- 🎥 HuggingFace — Fine-tuning LLMs with LoRA / QLoRA (Sebastian Raschka)
- 📖 LoRA paper — Hu et al.
- 📖 QLoRA paper — Dettmers et al.
- 📖 DPO paper — Rafailov et al.
Practice Problem: Container With Most Water (Medium)