Search Tech Journey

Find topics, journeys and posts

back to blog
ai mladvanced 12m2026-06-23

Day 25 — Practical Fine-Tuning — LoRA / QLoRA, PEFT, Instruction Datasets, DPO

Fine-tuning is back as the way to specialise models for your domain and reduce inference cost. LoRA + QLoRA make it tractable on commodity GPUs; DPO / ORPO have…

Fine-tuning lets you specialise a foundation model for a domain, style or task. It is not for adding raw knowledge (that's RAG); it's for shaping behaviour and format. Modern PEFT methods make this cheap.

🧠 Concept

Why it matters & the mental model.

1. The fine-tuning hierarchy

  • Prompting: free, instant, but limited.
  • Few-shot in context: better, but costs context tokens.
  • RAG: gives knowledge, doesn't change behaviour.
  • Fine-tuning (full): changes all weights, expensive (8×A100 for 7B).
  • PEFT (LoRA, etc.): trains <1% of params, almost as good for most tasks.

2. LoRA — the trick

For each weight matrix W ∈ R^\{d×k\}, the update during fine-tuning is ΔW = AB where A ∈ R^\{d×r\}, B ∈ R^\{r×k\}, r ≪ d, k. We freeze W, train only A, B. At inference, either compute (W + AB)x or merge W' = W + AB.

The assumption: fine-tuning updates lie in a low-rank subspace of the full weight space. Empirically true: r=8-32 captures most of the gain.

Memory: a 7B model in BF16 = ~14 GB; LoRA adapters at r=16 add < 100 MB. Trainable params drop from 7B → ~50M.

3. QLoRA — adding 4-bit quantisation

Base model loaded in NF4 (normal float 4-bit) — 4× memory cut. LoRA adapters stay in BF16. Result: 7B model fine-tuneable on a 24GB GPU; 70B on a single A100 80GB or 2×24GB with device_map="auto".

Quality cost vs full fine-tune: < 1pt on MMLU-style benchmarks for SFT. For RLHF/DPO the gap is similarly small.

🛠 Deep Dive

Internals, code, architecture.

4. SFT (supervised fine-tuning) recipe

  1. Pick base model (usually instruct / chat variant of Llama 3, Qwen, Mistral).
  2. Format dataset as (prompt, completion) with the model's chat template.
  3. Set: lr=2e-4 (LoRA), batch=4, grad_accum=8, epochs=3, lora_r=16, alpha=32, dropout=0.05, target modules = q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj.
  4. Use trl.SFTTrainer or unsloth (2-4× faster, less memory).
  5. Eval on held-out before merge; merge with model.merge_and_unload() if you want a single deployable.

5. Dataset quality > size

1k high-quality examples > 100k noisy ones. Curate ruthlessly:

  • De-duplicate (MinHash).
  • Filter for response quality (rubric-based or model-based).
  • Balance task types and difficulty.
  • Hold out a strict eval set (no leakage).

6. Alignment / preference tuning

After SFT, align to human preferences:

  • RLHF (PPO): train reward model on (prompt, win, lose), then PPO on policy. Powerful, finicky, GPU-heavy.
  • DPO: directly optimises policy on preference pairs with a simple ranking loss — no reward model, no RL loop. Has largely replaced PPO for production.
  • ORPO: combines SFT and preference in one loss — even simpler.
  • KTO: works with single-label preferences (👍/👎) without pairs.

DPO loss: -log σ(β · [log π(y_w|x) - log π(y_l|x) - log π_ref(y_w|x) + log π_ref(y_l|x)]). β=0.1 typical.

7. When NOT to fine-tune

  • Need fresh knowledge → use RAG, retrieval is cheap to update.
  • Task is sufficiently solved by prompting + few-shot.
  • < 500 examples available → likely overfit or hurt the base model.
  • You can't measure success → don't ship unmeasured changes.

🚀 In Practice

Trade-offs, exercises, what to ship today.

8. Catastrophic forgetting

Fine-tuning on a narrow task can degrade general capabilities. Mitigations:

  • Mix general instruction data with task data (e.g. 10% Alpaca / Tulu).
  • Lower learning rate, fewer epochs.
  • LoRA helps (base weights frozen).

9. Serving fine-tuned models

  • Merge LoRA into base for single artifact → standard vLLM serving.
  • Or keep adapters separate and swap at runtime (vLLM 0.5+ multi-LoRA, BentoML) → one base model serves many tenants.

10. Cost model

QLoRA SFT of 8B on 50k examples ≈ 4-12 GPU hours on A100, ~$10-50 on spot. DPO adds another 4-8 hours. Cheaper than the time it takes to prompt-engineer around the same problem.

11. What to take away

"When would you fine-tune vs RAG?" Strong answers: fine-tune for behaviour/format, RAG for knowledge; PEFT for cost; mention DPO over PPO as modern alignment; cite a real eval before/after.

Key points

    Resources

    Practice Problem: Container With Most Water (Medium)