Practical Fine-Tuning — LoRA, QLoRA, PEFT, Instruction Datasets
Session 33 of the 48-session learning series.
Date: Sun, 2026-07-05 · Time: 09:00–11:00 IST · Track: 📈 Machine Learning (ML) · Parent 28-day topic: Day 25 · Est. read: 2 h
Why this session matters
This is Session 33 of 48 in the ML track. Fine-tuning got 1000x cheaper in 24 months. A 7B model can now be domain-adapted on a single consumer GPU in under a day with LoRA + 4-bit quant. The question shifted from "can we fine-tune?" to "should we?" — and knowing how to answer that for a given problem is the new core skill.
Agenda
- Fine-tuning vs RAG vs prompting — the decision matrix
- LoRA — why low-rank adapters work; what to set rank to
- QLoRA — 4-bit base model + LoRA adapters; the consumer-GPU breakthrough
- Instruction datasets — quality, format, contamination
- Training pipeline — Axolotl, TRL, Unsloth, evaluation gates
Pre-read (skim before the session)
- LoRA paper (Hu et al., 2021)
- QLoRA paper (Dettmers et al., 2023)
- HuggingFace PEFT docs
- Sebastian Raschka — LoRA practical insights
Deep dive
1. Fine-tune vs RAG vs prompt — decision
| You need | Use |
|---|---|
| Knowledge about your private docs | RAG |
| Specific output format / style | Prompt first, then fine-tune |
| Domain vocabulary / terminology | Fine-tune |
| Better reasoning on niche tasks | Fine-tune (often w/ synthetic data) |
| Safety policy enforcement | Fine-tune + system prompt |
| Latency reduction (smaller distilled model) | Fine-tune smaller base |
| Up-to-date info | RAG (always) |
90% of teams should try RAG + prompt before fine-tuning. Fine-tuning is harder, slower, and costs more iteration time. But it unlocks abilities you can't prompt your way into.
2. The fine-tuning spectrum
- Continued pre-training — domain corpus, unsupervised. Adapts language model to your distribution. Most expensive.
- Supervised fine-tuning (SFT) — instruction → answer pairs. The base of any RLHF stack.
- DPO / IPO / KTO — preference tuning from
(chosen, rejected)pairs. No reward model needed. Now the default for alignment. - RLHF (PPO) — reward model + RL. Still used by frontier labs; rarely worth the complexity downstream.
For application teams: SFT → DPO is the modern pipeline.
3. Full fine-tune vs PEFT
Full fine-tune of a 7B model:
- Updates all 7B params.
- Needs ~14 GB just for fp16 weights, ~56 GB for fp32 Adam optimiser state.
- Doesn't fit on a single 24 GB consumer GPU.
PEFT (Parameter-Efficient Fine-Tuning) freezes the base and trains a tiny set of new params:
- LoRA, QLoRA, prefix tuning, adapters.
- 0.1–1% of params trained.
- Same quality on most tasks as full fine-tune.
- Adapters can be swapped at runtime → one base, many specialisations.
4. LoRA — the math (1 paragraph)
For each weight matrix W, instead of updating W directly, train two low-rank matrices A (d×r) and B (r×k), and the effective weight is W + B·A. Rank r is typically 8–64. Forward pass: (W + B·A)·x. Only A and B are trained; gradient memory is tiny.
Why it works: empirically, the "update" learned during fine-tuning is low-rank. You don't need to move every weight; you need to nudge the right subspace.
5. QLoRA — fitting big models on small GPUs
QLoRA combines:
- 4-bit base model (NF4 quantisation) — frozen, dequantised on the fly during forward pass.
- LoRA adapters in fp16 — trainable.
- Paged optimiser — Adam state lives in CPU RAM, paged in as needed.
Result: 65B model fine-tunable on a single 48 GB GPU; 7B on a single 24 GB consumer GPU.
Quality penalty vs full fp16 fine-tune: usually < 1 point on benchmarks. Negligible for almost all use cases.
6. Setting LoRA hyperparameters
- Rank
r— 8 for small tasks, 16–32 default, 64+ when underfitting. Higher rank = more capacity = more risk of overfitting. - Alpha — scales adapter contribution; typical
alpha = roralpha = 2r. - Target modules — at minimum
q_proj, v_proj. Better:q_proj, k_proj, v_proj, o_proj. Best (slowest): all linear layers. - Dropout — 0.05–0.1.
Start with rank 16, target qkvo, alpha 32, dropout 0.05. Tune from there.
7. Instruction dataset quality
The single biggest predictor of fine-tune success.
Rules:
- Quality > quantity. 1k high-quality > 100k low-quality, by a wide margin.
- Format consistency. Pick one chat template; stick to it.
- No contamination. Strip eval-set examples from training.
- Diverse instructions. Cover the breadth you want at inference.
- Match output style. Want concise? Train on concise.
- System prompt training. Always include in the format you'll use at inference.
LIMA paper (Meta, 2023): 1000 carefully curated examples beat 50k filtered scrapes.
8. Synthetic data
When real instruction data is scarce:
- Use GPT-4 / Claude to generate examples.
- Self-Instruct, Evol-Instruct patterns.
- WizardLM-style instruction evolution (rewrite simple → complex).
- Constitutional AI for safety dataset.
Caveats: model collapse if you only train on synthetic; mode collapse if generator is too narrow. Always blend with real examples (≥ 20%).
9. Training pipeline
Modern stack:
- Axolotl — YAML-driven, batteries-included. Multi-GPU, DeepSpeed integration.
- TRL (HuggingFace) — SFTTrainer, DPOTrainer, GRPOTrainer. Most flexibility.
- Unsloth — fastest single-GPU; custom kernels; 2× speedup.
- LLaMA-Factory — UI-driven; popular in China; good for non-engineers.
Example Axolotl config snippet:
base_model: meta-llama/Meta-Llama-3-8B
load_in_4bit: true
adapter: qlora
lora_r: 16
lora_alpha: 32
lora_target_modules: [q_proj, k_proj, v_proj, o_proj]
datasets:
- path: ./my-dataset.jsonl
type: chat_template
sequence_len: 4096
micro_batch_size: 2
gradient_accumulation_steps: 4
learning_rate: 2e-4
num_epochs: 3
10. Evaluation gates
Before declaring success:
- Held-out task accuracy — domain eval set, never seen in training.
- General capability — MMLU, HellaSwag — confirm you didn't degrade the base.
- Safety eval — refusals on harmful prompts; LLM-as-judge on a red-team set.
- Format compliance — does it return the JSON / structure you asked for? 100% is the target.
- Latency — adapter merge vs adapter-on-the-fly; serving impact.
Always compare against the base model with your best prompt — if prompt-engineering closes the gap, ship the prompt, not the tune.
11. Serving fine-tuned models
Options:
- Merged weights — fold adapter into base; ship one model file. Simplest serving, can't swap.
- Adapter at runtime — load base once, swap adapters per request. Multi-tenant; vLLM and TGI support this. Tiny per-tenant cost.
- Distilled smaller model — fine-tune a 7B to match a 70B's behaviour on your task. Most $$ saved long-term.
For a multi-customer SaaS: adapter-at-runtime is the winning pattern. Caveat: chooses one base; switching base means re-tuning all adapters.
12. Reality check
A 2-week fine-tune plan:
- Days 1–2: build a 200-example eval set; freeze it.
- Days 3–4: try prompting + RAG; record baseline.
- Days 5–8: curate a 1–5k SFT dataset; train QLoRA on Llama-3-8B.
- Days 9–10: eval; iterate dataset, not hyperparams.
- Days 11–12: DPO from production thumbs-down/thumbs-up.
- Days 13–14: serving integration; production canary.
If the baseline beats your fine-tune, ship the baseline. Most of the value is in the eval set + dataset curation, not the training itself.
Reading material
- LoRA paper (Hu et al., 2021)
- QLoRA paper (Dettmers et al., 2023)
- DPO paper (Rafailov et al., 2023)
- LIMA paper (Meta, 2023) — Less is More for Alignment
In-depth research material
- HuggingFace PEFT docs
- Axolotl repo
- TRL — Transformer Reinforcement Learning
- Sebastian Raschka — LLMs from scratch (book)
Video reference
▶︎ LoRA / QLoRA from First Principles (Yannic Kilcher)
Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.
LeetCode — Matrix Multiplication
- Link: https://leetcode.com/problems/matrix-multiplication/
- Difficulty: Medium
- Why this problem: LoRA is literally matrix decomposition
W ≈ W + B·A. Implementing matmul once cements the intuition. - Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Pick between full FT, LoRA, QLoRA for a given GPU + model combo.
- Explain why low-rank updates work for fine-tuning empirically.
- Configure LoRA hyperparameters (rank, alpha, target modules) sensibly.
- Curate an SFT dataset that respects format, diversity, and contamination rules.
- Design a 5-step eval gate before promoting a fine-tuned model.
- Solve
matrix-multiplication— the primitive operation that LoRA decomposes.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.