ai mlintermediate 12m2026-06-09

LLM Serving Part 2 — Speculative Decoding, Quantisation, Throughput

Session 34 of the 48-session learning series.

Date: Sun, 2026-07-05 · Time: 14:30–16:30 IST · Track: 🧠 LLMs & Agents (LLM) · Parent 28-day topic: Day 21 · Est. read: 2 h

Why this session matters

This is Session 34 of 48 in the LLM track. Part 1 covered the engine (vLLM, KV cache, batching). Part 2 covers the levers — quantisation, speculative decoding, throughput tuning — that turn that engine from "works" to "$0.50 per million tokens". If you operate any non-trivial LLM workload, these techniques are no longer optional.

Agenda

Quantisation — int8, int4, FP8, AWQ, GPTQ; what to pick
Speculative decoding — small draft + big verifier
Chunked prefill, prefix caching, lookahead decoding
Multi-LoRA serving — many adapters, one base
Throughput tuning — batch size, KV pool, parallelism

Pre-read (skim before the session)

Deep dive

1. Quantisation — the headline lever

Reducing precision of weights (and optionally activations) shrinks memory + accelerates compute.

Format	Bits	Memory ratio	Quality loss	Use
fp32	32	1.0×	reference	training
fp16 / bf16	16	0.5×	none	training + serving
fp8	8	0.25×	minimal (H100+)	serving (latest GPUs)
int8	8	0.25×	minimal	serving (any GPU)
int4	4	0.125×	0.5–1% benchmarks	serving
int3	3	0.094×	2–5%	experimental

Memory savings cascade: smaller weights → more room for KV cache → more concurrent requests → higher throughput.

2. Weight-only vs weight-and-activation quantisation

Weight-only (W8A16, W4A16) — weights quantised, activations stay fp16. Easier, less quality loss, easiest to get right.
Weight + activation (W8A8, FP8) — both quantised. Requires careful calibration; bigger throughput win.

Most production deployments are weight-only int8/int4. FP8 weights+activations on H100 is the rising standard.

3. The popular int4 methods

GPTQ — calibration-set post-training quantisation. Layer-by-layer optimisation. Standard for years; widely supported.
AWQ — Activation-aware. Identifies "salient" weights (driven by activation magnitude), keeps them at higher precision. Smaller quality loss than GPTQ at int4.
GGUF / Q4_K_M / Q5_K_S (llama.cpp ecosystem) — multiple bit widths within one model; great for CPU + Apple Silicon.

For server-side GPU: AWQ int4 is currently the sweet spot.

4. Calibration data matters

Quantisation algorithms learn from a small calibration set. Use:

~128–512 samples.
Representative of your serving distribution (not random web text).
Diverse: mix instruction, chat, domain-specific samples.

Same model quantised with wrong calibration = noticeably worse on your workload.

5. Speculative decoding

Decoding is memory-bound; the GPU is largely idle waiting for KV cache + weight reads. Idea: parallelise.

A small draft model proposes the next N tokens (cheap, fast).
The large target model verifies all N in one forward pass.
Accepted tokens are kept; first rejection truncates and continues from there.

If draft model is good, you get ~2–3× speedup with no quality loss (the target model is mathematically still the authority).

draft proposes: "the quick brown fox jumps"
target verifies all 5 in parallel:
   accepts "the quick brown" — rejects "fox"
   re-samples token 4 itself: "dog"
   draft restarts from "the quick brown dog"

Variants:

Self-speculative — same model with early-layer shortcut as draft (no separate model needed).
Lookahead decoding — draft via n-gram pattern matching (no model needed).
EAGLE / Medusa — small "head" branches on top of base model; cheap draft.

6. Chunked prefill

Prefill of a long prompt blocks all in-flight decodes (it monopolises GPU). Chunked prefill splits the prefill into segments, interleaves with decode steps. Result: smoother TTFT, less head-of-line blocking for short queries arriving during a long prefill.

vLLM and TGI both support. Set max_num_batched_tokens to control chunk size.

7. Prefix caching (already touched in S30)

Cache KV blocks for repeated prefixes. Hits common with:

System prompts (every request shares).
Few-shot exemplars.
Conversation history (each chat turn extends from previous).
RAG with same chunks for similar queries.

vLLM enables automatically with --enable-prefix-caching. The cache lives in GPU memory; evicted LRU. Typical 2–5× throughput improvement on chat workloads.

8. Multi-LoRA serving

Scenario: 100 customers, each with their own fine-tuned adapter. Don't deploy 100 separate models.

Pattern:

Load base model once.
Adapter (LoRA delta) loaded per request, small (~50 MB).
Use kernels that compute (W + B·A)·x on the fly.

vLLM with --enable-lora, TGI with the same feature. Single GPU serves 100s of LoRAs.

Big architectural win: one fleet for all customers, vs N fleets. Cost down 10–50×.

9. Parallelism for serving

Recap from S30:

Tensor parallel (TP) — splits weights across GPUs in one node. Low latency, high bandwidth (NVLink). Default for inference.
Pipeline parallel (PP) — layers on different nodes. Adds latency per pipeline stage; suited to networks with weak interconnect.
Expert parallel (EP) — MoE; experts across GPUs.

For 70B on 4× A100: TP=4. For 405B on 8× H100: TP=8 (or TP=4 PP=2).

10. Throughput tuning checklist

max_num_batched_tokens — chunked prefill threshold.
max_num_seqs — max in-flight requests.
gpu_memory_utilization — fraction of GPU reserved (0.85–0.92 typical).
kv_cache_dtype — fp16 vs fp8 (fp8 doubles KV capacity).
Quantised weights (int8/int4).
Prefix caching on.
Speculative decoding on if you have a good draft.

Each lever interacts. Tune one at a time, measure, retune. Spreadsheet of (config, throughput, p50, p99) is your friend.

11. Observability for inference

Token-level metrics — input tokens, output tokens, accepted-speculative-tokens.
Phase timings — prefill ms, decode ms/token.
Queue depth — pending requests.
KV cache occupancy — fraction full; alert at 90%.
Prefix-cache hit rate — should be 30–80% for chat workloads.

Standard exporters (Prometheus) ship with vLLM/TGI. Wire to Grafana, set SLO alerts.

12. Reality check

Cost-minimised stack circa 2026:

vLLM with int4 AWQ weights + fp8 KV cache.
Prefix caching enabled.
Self-speculative decoding via Medusa or EAGLE.
Multi-LoRA for tenant isolation.
TP=4 on a node of 4× A100 or H100.
Chunked prefill for smooth tail latency.

That setup serves 4–8× more tokens/sec than naive fp16 vLLM. Same hardware, same SLOs, ~5× cheaper per token. The math is obvious; the engineering is real work.

Link: https://leetcode.com/problems/longest-palindromic-substring/
Difficulty: Medium
Why this problem: Speculative-decode verification is essentially "longest accepted prefix of draft"; both problems are about matching segments efficiently.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Pick the right quantisation method (int8/int4/FP8/AWQ/GPTQ) for a given GPU and workload.
Explain why decoding is memory-bound and how speculative decoding exploits it.
Configure prefix caching and predict the throughput impact for a chat workload.
Stand up multi-LoRA serving with vLLM/TGI.
Tune max_num_batched_tokens, gpu_memory_utilization, and KV-cache dtype.
Solve longest-palindromic-substring — prefix-matching primitive used in spec verification.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

Practical Fine-Tuning — LoRA, QLoRA, PEFT, Instruction Datasets

News Feed / Timeline System — Fanout-on-Read vs Write, Ranking