Day 21 — LLM Serving — vLLM, Continuous Batching, KV Cache, Speculative Decoding
Inference cost and latency are the dominant operational concerns for any LLM product. vLLM-style continuous batching gives 5-20× throughput; speculative decodin…
LLM inference economics are dominated by two things: how many tokens/sec/$ you generate and how fast the first token arrives. Modern serving stacks attack both with batching, KV-cache management, and speculative decoding.
🧠 Concept
Why it matters & the mental model.
1. The autoregressive constraint
Generation is sequential: token N+1 needs token N's output. Each step is a forward pass that mostly re-reads the KV cache → memory-bandwidth bound, not compute bound. So we can fit many concurrent requests in one forward pass nearly for free → batch!
2. Request-level vs continuous batching
- Request-level (static): collect B requests, pad to max length, run together. Throughput-killer: short requests wait for the longest. Idle GPU 30-70% of the time.
- Continuous (iteration-level): at every decoding step, schedule whatever requests have un-generated tokens. New requests join immediately; finished requests leave the slot.
Static wastes the gap after short requests finish; continuous reuses those slots immediately.
3. PagedAttention (vLLM)
The KV cache is huge and per-request. Naive allocation → fragmentation → "I have 30 GB free but can't fit a 1k-token cache". vLLM treats KV as pages (16 tokens each) in a virtual address space — like OS paging. Result: 4× more concurrent requests at same memory, near-zero internal fragmentation.
🛠 Deep Dive
Internals, code, architecture.
4. Quantisation
- Weights: FP16 → INT8 → INT4. AWQ, GPTQ, bitsandbytes. INT4 ≈ 4× memory cut, < 1 pt MMLU loss on 7B+ models.
- KV cache: FP8 / INT8 → smaller cache → more concurrency.
- Activations: trickier, usually FP8 with FP8 GEMM on H100 / MI300.
5. Speculative decoding
Idea: a small "draft" model proposes K tokens cheaply; the big "verifier" model checks them all in one forward pass. Accepted prefix is emitted; rejected suffix is regenerated.
- 2-3× wall-clock speedup, identical output distribution.
- Variants: EAGLE (use the big model's hidden states), lookahead (no draft model, n-gram lookup), medusa (multiple decoding heads).
6. Prefix / prompt caching
For agents and chat apps, the system prompt + few-shot examples repeat across requests. Cache the K/V for that prefix. Anthropic / OpenAI expose this in their APIs (~90% discount on cached input tokens).
7. Multi-GPU strategies
- Tensor parallelism: shard each layer across GPUs (Megatron). Fast within a node (NVLink).
- Pipeline parallelism: layers across nodes. Bubbles in pipeline if not careful.
- Expert parallelism: MoE experts across nodes.
- Sequence parallelism: shard along sequence axis (used in long-context).
🚀 In Practice
Trade-offs, exercises, what to ship today.
8. Hardware reality
- A100 80GB: workhorse. 312 TFLOPS BF16.
- H100 / H200: FP8, transformer engine, much better $/token.
- B200 / GB200: 2025 deployments, 30× perf/W vs A100 for inference.
- AMD MI300X: huge HBM (192 GB), gaining mindshare via ROCm + vLLM.
- Inferentia / TPU v5e: cheap-ish for compatible models.
9. Latency budget
- TTFT (time to first token): dominated by prompt-length × prefill compute. Cache prefix or shorten prompt.
- TPOT (time per output token): decoding-bound, ~30-100 ms depending on model size. Continuous batching + speculative decode here.
- End-to-end: TTFT + N × TPOT.
10. Picking a stack
- vLLM: open source default. Fast, broad model support, continuous batching, PagedAttention.
- TGI (Hugging Face): similar features, easy with HF ecosystem.
- TensorRT-LLM: NVIDIA-optimised, fastest on Hopper if you can build engines.
- SGLang: best at structured generation + cache reuse for agents.
- Hosted (Anthropic / OpenAI / Together / Fireworks): zero ops, predictable cost.
11. What to take away
"Why is your inference fast?" Strong answers: continuous batching + paged KV + quantised weights + prefix cache + speculative decode. Bonus: name the latency budget breakdown.
Resources
- 🎥 vLLM — Continuous Batching Explained
- 📖 vLLM blog — PagedAttention
- 📖 Anyscale — How continuous batching enables 23x throughput
- 📖 Speculative Decoding — Google Research blog
Practice Problem: Design In-Memory File System (Hard)