LLM Serving Part 2 — Speculative Decoding, Quantisation, Throughput
Session 34 of the 48-session learning series.
Date: Sun, 2026-07-05 · Time: 14:30–16:30 IST · Track: 🧠 LLMs & Agents (LLM) · Parent 28-day topic: Day 21 · Est. read: 2 h
Why this session matters
This is Session 34 of 48 in the LLM track. Part 1 covered the engine (vLLM, KV cache, batching). Part 2 covers the levers — quantisation, speculative decoding, throughput tuning — that turn that engine from "works" to "$0.50 per million tokens". If you operate any non-trivial LLM workload, these techniques are no longer optional.
Agenda
- Quantisation — int8, int4, FP8, AWQ, GPTQ; what to pick
- Speculative decoding — small draft + big verifier
- Chunked prefill, prefix caching, lookahead decoding
- Multi-LoRA serving — many adapters, one base
- Throughput tuning — batch size, KV pool, parallelism
Pre-read (skim before the session)
- Speculative Decoding (Leviathan et al., 2023)
- GPTQ paper (Frantar et al., 2022)
- AWQ paper (Lin et al., 2023)
- vLLM docs — prefix caching
Deep dive
1. Quantisation — the headline lever
Reducing precision of weights (and optionally activations) shrinks memory + accelerates compute.
| Format | Bits | Memory ratio | Quality loss | Use |
|---|---|---|---|---|
| fp32 | 32 | 1.0× | reference | training |
| fp16 / bf16 | 16 | 0.5× | none | training + serving |
| fp8 | 8 | 0.25× | minimal (H100+) | serving (latest GPUs) |
| int8 | 8 | 0.25× | minimal | serving (any GPU) |
| int4 | 4 | 0.125× | 0.5–1% benchmarks | serving |
| int3 | 3 | 0.094× | 2–5% | experimental |
Memory savings cascade: smaller weights → more room for KV cache → more concurrent requests → higher throughput.
2. Weight-only vs weight-and-activation quantisation
- Weight-only (W8A16, W4A16) — weights quantised, activations stay fp16. Easier, less quality loss, easiest to get right.
- Weight + activation (W8A8, FP8) — both quantised. Requires careful calibration; bigger throughput win.
Most production deployments are weight-only int8/int4. FP8 weights+activations on H100 is the rising standard.
3. The popular int4 methods
- GPTQ — calibration-set post-training quantisation. Layer-by-layer optimisation. Standard for years; widely supported.
- AWQ — Activation-aware. Identifies "salient" weights (driven by activation magnitude), keeps them at higher precision. Smaller quality loss than GPTQ at int4.
- GGUF / Q4_K_M / Q5_K_S (llama.cpp ecosystem) — multiple bit widths within one model; great for CPU + Apple Silicon.
For server-side GPU: AWQ int4 is currently the sweet spot.
4. Calibration data matters
Quantisation algorithms learn from a small calibration set. Use:
- ~128–512 samples.
- Representative of your serving distribution (not random web text).
- Diverse: mix instruction, chat, domain-specific samples.
Same model quantised with wrong calibration = noticeably worse on your workload.
5. Speculative decoding
Decoding is memory-bound; the GPU is largely idle waiting for KV cache + weight reads. Idea: parallelise.
- A small draft model proposes the next N tokens (cheap, fast).
- The large target model verifies all N in one forward pass.
- Accepted tokens are kept; first rejection truncates and continues from there.
If draft model is good, you get ~2–3× speedup with no quality loss (the target model is mathematically still the authority).
draft proposes: "the quick brown fox jumps"
target verifies all 5 in parallel:
accepts "the quick brown" — rejects "fox"
re-samples token 4 itself: "dog"
draft restarts from "the quick brown dog"
Variants:
- Self-speculative — same model with early-layer shortcut as draft (no separate model needed).
- Lookahead decoding — draft via n-gram pattern matching (no model needed).
- EAGLE / Medusa — small "head" branches on top of base model; cheap draft.
6. Chunked prefill
Prefill of a long prompt blocks all in-flight decodes (it monopolises GPU). Chunked prefill splits the prefill into segments, interleaves with decode steps. Result: smoother TTFT, less head-of-line blocking for short queries arriving during a long prefill.
vLLM and TGI both support. Set max_num_batched_tokens to control chunk size.
7. Prefix caching (already touched in S30)
Cache KV blocks for repeated prefixes. Hits common with:
- System prompts (every request shares).
- Few-shot exemplars.
- Conversation history (each chat turn extends from previous).
- RAG with same chunks for similar queries.
vLLM enables automatically with --enable-prefix-caching. The cache lives in GPU memory; evicted LRU. Typical 2–5× throughput improvement on chat workloads.
8. Multi-LoRA serving
Scenario: 100 customers, each with their own fine-tuned adapter. Don't deploy 100 separate models.
Pattern:
- Load base model once.
- Adapter (LoRA delta) loaded per request, small (~50 MB).
- Use kernels that compute
(W + B·A)·xon the fly.
vLLM with --enable-lora, TGI with the same feature. Single GPU serves 100s of LoRAs.
Big architectural win: one fleet for all customers, vs N fleets. Cost down 10–50×.
9. Parallelism for serving
Recap from S30:
- Tensor parallel (TP) — splits weights across GPUs in one node. Low latency, high bandwidth (NVLink). Default for inference.
- Pipeline parallel (PP) — layers on different nodes. Adds latency per pipeline stage; suited to networks with weak interconnect.
- Expert parallel (EP) — MoE; experts across GPUs.
For 70B on 4× A100: TP=4. For 405B on 8× H100: TP=8 (or TP=4 PP=2).
10. Throughput tuning checklist
max_num_batched_tokens— chunked prefill threshold.max_num_seqs— max in-flight requests.gpu_memory_utilization— fraction of GPU reserved (0.85–0.92 typical).kv_cache_dtype— fp16 vs fp8 (fp8 doubles KV capacity).- Quantised weights (int8/int4).
- Prefix caching on.
- Speculative decoding on if you have a good draft.
Each lever interacts. Tune one at a time, measure, retune. Spreadsheet of (config, throughput, p50, p99) is your friend.
11. Observability for inference
- Token-level metrics — input tokens, output tokens, accepted-speculative-tokens.
- Phase timings — prefill ms, decode ms/token.
- Queue depth — pending requests.
- KV cache occupancy — fraction full; alert at 90%.
- Prefix-cache hit rate — should be 30–80% for chat workloads.
Standard exporters (Prometheus) ship with vLLM/TGI. Wire to Grafana, set SLO alerts.
12. Reality check
Cost-minimised stack circa 2026:
- vLLM with int4 AWQ weights + fp8 KV cache.
- Prefix caching enabled.
- Self-speculative decoding via Medusa or EAGLE.
- Multi-LoRA for tenant isolation.
- TP=4 on a node of 4× A100 or H100.
- Chunked prefill for smooth tail latency.
That setup serves 4–8× more tokens/sec than naive fp16 vLLM. Same hardware, same SLOs, ~5× cheaper per token. The math is obvious; the engineering is real work.
Reading material
- Speculative Decoding (Leviathan et al., 2023)
- GPTQ paper (Frantar et al., 2022)
- AWQ paper (Lin et al., 2023)
- vLLM Performance Optimisation guide
In-depth research material
- EAGLE (Speculative Sampling)
- Medusa (Multiple Decoding Heads)
- FlashAttention-3 paper
- vLLM source — speculative decoding
Video reference
▶︎ Speculative Decoding & Quantisation Explained (Mark Saroufim)
Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.
LeetCode — Longest Palindromic Substring
- Link: https://leetcode.com/problems/longest-palindromic-substring/
- Difficulty: Medium
- Why this problem: Speculative-decode verification is essentially "longest accepted prefix of draft"; both problems are about matching segments efficiently.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Pick the right quantisation method (int8/int4/FP8/AWQ/GPTQ) for a given GPU and workload.
- Explain why decoding is memory-bound and how speculative decoding exploits it.
- Configure prefix caching and predict the throughput impact for a chat workload.
- Stand up multi-LoRA serving with vLLM/TGI.
- Tune
max_num_batched_tokens,gpu_memory_utilization, and KV-cache dtype. - Solve
longest-palindromic-substring— prefix-matching primitive used in spec verification.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.