Search Tech Journey

Find topics, journeys and posts

back to blog
ai mlintermediate 12m2026-06-09

Practical Fine-Tuning — LoRA, QLoRA, PEFT, Instruction Datasets

Session 33 of the 48-session learning series.

Date: Sun, 2026-07-05 · Time: 09:00–11:00 IST · Track: 📈 Machine Learning (ML) · Parent 28-day topic: Day 25 · Est. read: 2 h

Why this session matters

This is Session 33 of 48 in the ML track. Fine-tuning got 1000x cheaper in 24 months. A 7B model can now be domain-adapted on a single consumer GPU in under a day with LoRA + 4-bit quant. The question shifted from "can we fine-tune?" to "should we?" — and knowing how to answer that for a given problem is the new core skill.

Agenda

  • Fine-tuning vs RAG vs prompting — the decision matrix
  • LoRA — why low-rank adapters work; what to set rank to
  • QLoRA — 4-bit base model + LoRA adapters; the consumer-GPU breakthrough
  • Instruction datasets — quality, format, contamination
  • Training pipeline — Axolotl, TRL, Unsloth, evaluation gates

Pre-read (skim before the session)

Deep dive

1. Fine-tune vs RAG vs prompt — decision

You needUse
Knowledge about your private docsRAG
Specific output format / stylePrompt first, then fine-tune
Domain vocabulary / terminologyFine-tune
Better reasoning on niche tasksFine-tune (often w/ synthetic data)
Safety policy enforcementFine-tune + system prompt
Latency reduction (smaller distilled model)Fine-tune smaller base
Up-to-date infoRAG (always)

90% of teams should try RAG + prompt before fine-tuning. Fine-tuning is harder, slower, and costs more iteration time. But it unlocks abilities you can't prompt your way into.

2. The fine-tuning spectrum

  • Continued pre-training — domain corpus, unsupervised. Adapts language model to your distribution. Most expensive.
  • Supervised fine-tuning (SFT) — instruction → answer pairs. The base of any RLHF stack.
  • DPO / IPO / KTO — preference tuning from (chosen, rejected) pairs. No reward model needed. Now the default for alignment.
  • RLHF (PPO) — reward model + RL. Still used by frontier labs; rarely worth the complexity downstream.

For application teams: SFT → DPO is the modern pipeline.

3. Full fine-tune vs PEFT

Full fine-tune of a 7B model:

  • Updates all 7B params.
  • Needs ~14 GB just for fp16 weights, ~56 GB for fp32 Adam optimiser state.
  • Doesn't fit on a single 24 GB consumer GPU.

PEFT (Parameter-Efficient Fine-Tuning) freezes the base and trains a tiny set of new params:

  • LoRA, QLoRA, prefix tuning, adapters.
  • 0.1–1% of params trained.
  • Same quality on most tasks as full fine-tune.
  • Adapters can be swapped at runtime → one base, many specialisations.

4. LoRA — the math (1 paragraph)

For each weight matrix W, instead of updating W directly, train two low-rank matrices A (d×r) and B (r×k), and the effective weight is W + B·A. Rank r is typically 8–64. Forward pass: (W + B·A)·x. Only A and B are trained; gradient memory is tiny.

Why it works: empirically, the "update" learned during fine-tuning is low-rank. You don't need to move every weight; you need to nudge the right subspace.

5. QLoRA — fitting big models on small GPUs

QLoRA combines:

  • 4-bit base model (NF4 quantisation) — frozen, dequantised on the fly during forward pass.
  • LoRA adapters in fp16 — trainable.
  • Paged optimiser — Adam state lives in CPU RAM, paged in as needed.

Result: 65B model fine-tunable on a single 48 GB GPU; 7B on a single 24 GB consumer GPU.

Quality penalty vs full fp16 fine-tune: usually < 1 point on benchmarks. Negligible for almost all use cases.

6. Setting LoRA hyperparameters

  • Rank r — 8 for small tasks, 16–32 default, 64+ when underfitting. Higher rank = more capacity = more risk of overfitting.
  • Alpha — scales adapter contribution; typical alpha = r or alpha = 2r.
  • Target modules — at minimum q_proj, v_proj. Better: q_proj, k_proj, v_proj, o_proj. Best (slowest): all linear layers.
  • Dropout — 0.05–0.1.

Start with rank 16, target qkvo, alpha 32, dropout 0.05. Tune from there.

7. Instruction dataset quality

The single biggest predictor of fine-tune success.

Rules:

  • Quality > quantity. 1k high-quality > 100k low-quality, by a wide margin.
  • Format consistency. Pick one chat template; stick to it.
  • No contamination. Strip eval-set examples from training.
  • Diverse instructions. Cover the breadth you want at inference.
  • Match output style. Want concise? Train on concise.
  • System prompt training. Always include in the format you'll use at inference.

LIMA paper (Meta, 2023): 1000 carefully curated examples beat 50k filtered scrapes.

8. Synthetic data

When real instruction data is scarce:

  • Use GPT-4 / Claude to generate examples.
  • Self-Instruct, Evol-Instruct patterns.
  • WizardLM-style instruction evolution (rewrite simple → complex).
  • Constitutional AI for safety dataset.

Caveats: model collapse if you only train on synthetic; mode collapse if generator is too narrow. Always blend with real examples (≥ 20%).

9. Training pipeline

Modern stack:

  • Axolotl — YAML-driven, batteries-included. Multi-GPU, DeepSpeed integration.
  • TRL (HuggingFace) — SFTTrainer, DPOTrainer, GRPOTrainer. Most flexibility.
  • Unsloth — fastest single-GPU; custom kernels; 2× speedup.
  • LLaMA-Factory — UI-driven; popular in China; good for non-engineers.

Example Axolotl config snippet:

base_model: meta-llama/Meta-Llama-3-8B
load_in_4bit: true
adapter: qlora
lora_r: 16
lora_alpha: 32
lora_target_modules: [q_proj, k_proj, v_proj, o_proj]
datasets:
  - path: ./my-dataset.jsonl
    type: chat_template
sequence_len: 4096
micro_batch_size: 2
gradient_accumulation_steps: 4
learning_rate: 2e-4
num_epochs: 3

10. Evaluation gates

Before declaring success:

  • Held-out task accuracy — domain eval set, never seen in training.
  • General capability — MMLU, HellaSwag — confirm you didn't degrade the base.
  • Safety eval — refusals on harmful prompts; LLM-as-judge on a red-team set.
  • Format compliance — does it return the JSON / structure you asked for? 100% is the target.
  • Latency — adapter merge vs adapter-on-the-fly; serving impact.

Always compare against the base model with your best prompt — if prompt-engineering closes the gap, ship the prompt, not the tune.

11. Serving fine-tuned models

Options:

  • Merged weights — fold adapter into base; ship one model file. Simplest serving, can't swap.
  • Adapter at runtime — load base once, swap adapters per request. Multi-tenant; vLLM and TGI support this. Tiny per-tenant cost.
  • Distilled smaller model — fine-tune a 7B to match a 70B's behaviour on your task. Most $$ saved long-term.

For a multi-customer SaaS: adapter-at-runtime is the winning pattern. Caveat: chooses one base; switching base means re-tuning all adapters.

12. Reality check

A 2-week fine-tune plan:

  1. Days 1–2: build a 200-example eval set; freeze it.
  2. Days 3–4: try prompting + RAG; record baseline.
  3. Days 5–8: curate a 1–5k SFT dataset; train QLoRA on Llama-3-8B.
  4. Days 9–10: eval; iterate dataset, not hyperparams.
  5. Days 11–12: DPO from production thumbs-down/thumbs-up.
  6. Days 13–14: serving integration; production canary.

If the baseline beats your fine-tune, ship the baseline. Most of the value is in the eval set + dataset curation, not the training itself.

Reading material

In-depth research material

Video reference

▶︎ LoRA / QLoRA from First Principles (Yannic Kilcher)

Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.

LeetCode — Matrix Multiplication

Post-session checklist

By the end of this session you should be able to:

  • Pick between full FT, LoRA, QLoRA for a given GPU + model combo.
  • Explain why low-rank updates work for fine-tuning empirically.
  • Configure LoRA hyperparameters (rank, alpha, target modules) sensibly.
  • Curate an SFT dataset that respects format, diversity, and contamination rules.
  • Design a 5-step eval gate before promoting a fine-tuned model.
  • Solve matrix-multiplication — the primitive operation that LoRA decomposes.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.