ai mladvanced 12m2026-06-23

Day 25 — Practical Fine-Tuning — LoRA / QLoRA, PEFT, Instruction Datasets, DPO

Fine-tuning is back as the way to specialise models for your domain and reduce inference cost. LoRA + QLoRA make it tractable on commodity GPUs; DPO / ORPO have…

Fine-tuning lets you specialise a foundation model for a domain, style or task. It is not for adding raw knowledge (that's RAG); it's for shaping behaviour and format. Modern PEFT methods make this cheap.

🧠 Concept

Why it matters & the mental model.

1. The fine-tuning hierarchy

Prompting: free, instant, but limited.
Few-shot in context: better, but costs context tokens.
RAG: gives knowledge, doesn't change behaviour.
Fine-tuning (full): changes all weights, expensive (8×A100 for 7B).
PEFT (LoRA, etc.): trains <1% of params, almost as good for most tasks.

2. LoRA — the trick

For each weight matrix W ∈ R^\{d×k\}, the update during fine-tuning is ΔW = AB where A ∈ R^\{d×r\}, B ∈ R^\{r×k\}, r ≪ d, k. We freeze W, train only A, B. At inference, either compute (W + AB)x or merge W' = W + AB.

The assumption: fine-tuning updates lie in a low-rank subspace of the full weight space. Empirically true: r=8-32 captures most of the gain.

Memory: a 7B model in BF16 = ~14 GB; LoRA adapters at r=16 add < 100 MB. Trainable params drop from 7B → ~50M.

3. QLoRA — adding 4-bit quantisation

Base model loaded in NF4 (normal float 4-bit) — 4× memory cut. LoRA adapters stay in BF16. Result: 7B model fine-tuneable on a 24GB GPU; 70B on a single A100 80GB or 2×24GB with device_map="auto".

Quality cost vs full fine-tune: < 1pt on MMLU-style benchmarks for SFT. For RLHF/DPO the gap is similarly small.

🛠 Deep Dive

Internals, code, architecture.

4. SFT (supervised fine-tuning) recipe

Pick base model (usually instruct / chat variant of Llama 3, Qwen, Mistral).
Format dataset as (prompt, completion) with the model's chat template.
Set: lr=2e-4 (LoRA), batch=4, grad_accum=8, epochs=3, lora_r=16, alpha=32, dropout=0.05, target modules = q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj.
Use trl.SFTTrainer or unsloth (2-4× faster, less memory).
Eval on held-out before merge; merge with model.merge_and_unload() if you want a single deployable.

5. Dataset quality > size

1k high-quality examples > 100k noisy ones. Curate ruthlessly:

De-duplicate (MinHash).
Filter for response quality (rubric-based or model-based).
Balance task types and difficulty.
Hold out a strict eval set (no leakage).

6. Alignment / preference tuning

After SFT, align to human preferences:

RLHF (PPO): train reward model on (prompt, win, lose), then PPO on policy. Powerful, finicky, GPU-heavy.
DPO: directly optimises policy on preference pairs with a simple ranking loss — no reward model, no RL loop. Has largely replaced PPO for production.
ORPO: combines SFT and preference in one loss — even simpler.
KTO: works with single-label preferences (👍/👎) without pairs.

DPO loss: -log σ(β · [log π(y_w|x) - log π(y_l|x) - log π_ref(y_w|x) + log π_ref(y_l|x)]). β=0.1 typical.

7. When NOT to fine-tune

Need fresh knowledge → use RAG, retrieval is cheap to update.
Task is sufficiently solved by prompting + few-shot.
< 500 examples available → likely overfit or hurt the base model.
You can't measure success → don't ship unmeasured changes.

🚀 In Practice

Trade-offs, exercises, what to ship today.

8. Catastrophic forgetting

Fine-tuning on a narrow task can degrade general capabilities. Mitigations:

Mix general instruction data with task data (e.g. 10% Alpaca / Tulu).
Lower learning rate, fewer epochs.
LoRA helps (base weights frozen).

9. Serving fine-tuned models

Merge LoRA into base for single artifact → standard vLLM serving.
Or keep adapters separate and swap at runtime (vLLM 0.5+ multi-LoRA, BentoML) → one base model serves many tenants.

Day 24 — Data Governance, Lineage, Quality — Catalogs, Contracts, Observability

Day 26 — Caching Strategies — CDN, Application Cache, Cache-Aside, Read-Through, Write-Through