Search Tech Journey

Find topics, journeys and posts

back to blog
ai mlintermediate 12m2026-06-09

LLM Serving Part 2 — Speculative Decoding, Quantisation, Throughput

Session 34 of the 48-session learning series.

Date: Sun, 2026-07-05 · Time: 14:30–16:30 IST · Track: 🧠 LLMs & Agents (LLM) · Parent 28-day topic: Day 21 · Est. read: 2 h

Why this session matters

This is Session 34 of 48 in the LLM track. Part 1 covered the engine (vLLM, KV cache, batching). Part 2 covers the levers — quantisation, speculative decoding, throughput tuning — that turn that engine from "works" to "$0.50 per million tokens". If you operate any non-trivial LLM workload, these techniques are no longer optional.

Agenda

  • Quantisation — int8, int4, FP8, AWQ, GPTQ; what to pick
  • Speculative decoding — small draft + big verifier
  • Chunked prefill, prefix caching, lookahead decoding
  • Multi-LoRA serving — many adapters, one base
  • Throughput tuning — batch size, KV pool, parallelism

Pre-read (skim before the session)

Deep dive

1. Quantisation — the headline lever

Reducing precision of weights (and optionally activations) shrinks memory + accelerates compute.

FormatBitsMemory ratioQuality lossUse
fp32321.0×referencetraining
fp16 / bf16160.5×nonetraining + serving
fp880.25×minimal (H100+)serving (latest GPUs)
int880.25×minimalserving (any GPU)
int440.125×0.5–1% benchmarksserving
int330.094×2–5%experimental

Memory savings cascade: smaller weights → more room for KV cache → more concurrent requests → higher throughput.

2. Weight-only vs weight-and-activation quantisation

  • Weight-only (W8A16, W4A16) — weights quantised, activations stay fp16. Easier, less quality loss, easiest to get right.
  • Weight + activation (W8A8, FP8) — both quantised. Requires careful calibration; bigger throughput win.

Most production deployments are weight-only int8/int4. FP8 weights+activations on H100 is the rising standard.

  • GPTQ — calibration-set post-training quantisation. Layer-by-layer optimisation. Standard for years; widely supported.
  • AWQ — Activation-aware. Identifies "salient" weights (driven by activation magnitude), keeps them at higher precision. Smaller quality loss than GPTQ at int4.
  • GGUF / Q4_K_M / Q5_K_S (llama.cpp ecosystem) — multiple bit widths within one model; great for CPU + Apple Silicon.

For server-side GPU: AWQ int4 is currently the sweet spot.

4. Calibration data matters

Quantisation algorithms learn from a small calibration set. Use:

  • ~128–512 samples.
  • Representative of your serving distribution (not random web text).
  • Diverse: mix instruction, chat, domain-specific samples.

Same model quantised with wrong calibration = noticeably worse on your workload.

5. Speculative decoding

Decoding is memory-bound; the GPU is largely idle waiting for KV cache + weight reads. Idea: parallelise.

  • A small draft model proposes the next N tokens (cheap, fast).
  • The large target model verifies all N in one forward pass.
  • Accepted tokens are kept; first rejection truncates and continues from there.

If draft model is good, you get ~2–3× speedup with no quality loss (the target model is mathematically still the authority).

draft proposes: "the quick brown fox jumps"
target verifies all 5 in parallel:
   accepts "the quick brown" — rejects "fox"
   re-samples token 4 itself: "dog"
   draft restarts from "the quick brown dog"

Variants:

  • Self-speculative — same model with early-layer shortcut as draft (no separate model needed).
  • Lookahead decoding — draft via n-gram pattern matching (no model needed).
  • EAGLE / Medusa — small "head" branches on top of base model; cheap draft.

6. Chunked prefill

Prefill of a long prompt blocks all in-flight decodes (it monopolises GPU). Chunked prefill splits the prefill into segments, interleaves with decode steps. Result: smoother TTFT, less head-of-line blocking for short queries arriving during a long prefill.

vLLM and TGI both support. Set max_num_batched_tokens to control chunk size.

7. Prefix caching (already touched in S30)

Cache KV blocks for repeated prefixes. Hits common with:

  • System prompts (every request shares).
  • Few-shot exemplars.
  • Conversation history (each chat turn extends from previous).
  • RAG with same chunks for similar queries.

vLLM enables automatically with --enable-prefix-caching. The cache lives in GPU memory; evicted LRU. Typical 2–5× throughput improvement on chat workloads.

8. Multi-LoRA serving

Scenario: 100 customers, each with their own fine-tuned adapter. Don't deploy 100 separate models.

Pattern:

  • Load base model once.
  • Adapter (LoRA delta) loaded per request, small (~50 MB).
  • Use kernels that compute (W + B·A)·x on the fly.

vLLM with --enable-lora, TGI with the same feature. Single GPU serves 100s of LoRAs.

Big architectural win: one fleet for all customers, vs N fleets. Cost down 10–50×.

9. Parallelism for serving

Recap from S30:

  • Tensor parallel (TP) — splits weights across GPUs in one node. Low latency, high bandwidth (NVLink). Default for inference.
  • Pipeline parallel (PP) — layers on different nodes. Adds latency per pipeline stage; suited to networks with weak interconnect.
  • Expert parallel (EP) — MoE; experts across GPUs.

For 70B on 4× A100: TP=4. For 405B on 8× H100: TP=8 (or TP=4 PP=2).

10. Throughput tuning checklist

  • max_num_batched_tokens — chunked prefill threshold.
  • max_num_seqs — max in-flight requests.
  • gpu_memory_utilization — fraction of GPU reserved (0.85–0.92 typical).
  • kv_cache_dtype — fp16 vs fp8 (fp8 doubles KV capacity).
  • Quantised weights (int8/int4).
  • Prefix caching on.
  • Speculative decoding on if you have a good draft.

Each lever interacts. Tune one at a time, measure, retune. Spreadsheet of (config, throughput, p50, p99) is your friend.

11. Observability for inference

  • Token-level metrics — input tokens, output tokens, accepted-speculative-tokens.
  • Phase timings — prefill ms, decode ms/token.
  • Queue depth — pending requests.
  • KV cache occupancy — fraction full; alert at 90%.
  • Prefix-cache hit rate — should be 30–80% for chat workloads.

Standard exporters (Prometheus) ship with vLLM/TGI. Wire to Grafana, set SLO alerts.

12. Reality check

Cost-minimised stack circa 2026:

  • vLLM with int4 AWQ weights + fp8 KV cache.
  • Prefix caching enabled.
  • Self-speculative decoding via Medusa or EAGLE.
  • Multi-LoRA for tenant isolation.
  • TP=4 on a node of 4× A100 or H100.
  • Chunked prefill for smooth tail latency.

That setup serves 4–8× more tokens/sec than naive fp16 vLLM. Same hardware, same SLOs, ~5× cheaper per token. The math is obvious; the engineering is real work.

Reading material

In-depth research material

Video reference

▶︎ Speculative Decoding & Quantisation Explained (Mark Saroufim)

Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.

LeetCode — Longest Palindromic Substring

Post-session checklist

By the end of this session you should be able to:

  • Pick the right quantisation method (int8/int4/FP8/AWQ/GPTQ) for a given GPU and workload.
  • Explain why decoding is memory-bound and how speculative decoding exploits it.
  • Configure prefix caching and predict the throughput impact for a chat workload.
  • Stand up multi-LoRA serving with vLLM/TGI.
  • Tune max_num_batched_tokens, gpu_memory_utilization, and KV-cache dtype.
  • Solve longest-palindromic-substring — prefix-matching primitive used in spec verification.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.