ai mladvanced 12m2026-06-19

Day 21 — LLM Serving — vLLM, Continuous Batching, KV Cache, Speculative Decoding

Inference cost and latency are the dominant operational concerns for any LLM product. vLLM-style continuous batching gives 5-20× throughput; speculative decodin…

LLM inference economics are dominated by two things: how many tokens/sec/$ you generate and how fast the first token arrives. Modern serving stacks attack both with batching, KV-cache management, and speculative decoding.

🧠 Concept

Why it matters & the mental model.

1. The autoregressive constraint

Generation is sequential: token N+1 needs token N's output. Each step is a forward pass that mostly re-reads the KV cache → memory-bandwidth bound, not compute bound. So we can fit many concurrent requests in one forward pass nearly for free → batch!

2. Request-level vs continuous batching

Request-level (static): collect B requests, pad to max length, run together. Throughput-killer: short requests wait for the longest. Idle GPU 30-70% of the time.
Continuous (iteration-level): at every decoding step, schedule whatever requests have un-generated tokens. New requests join immediately; finished requests leave the slot.

Static wastes the gap after short requests finish; continuous reuses those slots immediately.

3. PagedAttention (vLLM)

The KV cache is huge and per-request. Naive allocation → fragmentation → "I have 30 GB free but can't fit a 1k-token cache". vLLM treats KV as pages (16 tokens each) in a virtual address space — like OS paging. Result: 4× more concurrent requests at same memory, near-zero internal fragmentation.

🛠 Deep Dive

Internals, code, architecture.

4. Quantisation

Weights: FP16 → INT8 → INT4. AWQ, GPTQ, bitsandbytes. INT4 ≈ 4× memory cut, < 1 pt MMLU loss on 7B+ models.
KV cache: FP8 / INT8 → smaller cache → more concurrency.
Activations: trickier, usually FP8 with FP8 GEMM on H100 / MI300.

5. Speculative decoding

Idea: a small "draft" model proposes K tokens cheaply; the big "verifier" model checks them all in one forward pass. Accepted prefix is emitted; rejected suffix is regenerated.

2-3× wall-clock speedup, identical output distribution.
Variants: EAGLE (use the big model's hidden states), lookahead (no draft model, n-gram lookup), medusa (multiple decoding heads).

6. Prefix / prompt caching

For agents and chat apps, the system prompt + few-shot examples repeat across requests. Cache the K/V for that prefix. Anthropic / OpenAI expose this in their APIs (~90% discount on cached input tokens).

7. Multi-GPU strategies

Tensor parallelism: shard each layer across GPUs (Megatron). Fast within a node (NVLink).
Pipeline parallelism: layers across nodes. Bubbles in pipeline if not careful.
Expert parallelism: MoE experts across nodes.
Sequence parallelism: shard along sequence axis (used in long-context).

🚀 In Practice

Trade-offs, exercises, what to ship today.

8. Hardware reality

A100 80GB: workhorse. 312 TFLOPS BF16.
H100 / H200: FP8, transformer engine, much better $/token.
B200 / GB200: 2025 deployments, 30× perf/W vs A100 for inference.
AMD MI300X: huge HBM (192 GB), gaining mindshare via ROCm + vLLM.
Inferentia / TPU v5e: cheap-ish for compatible models.

9. Latency budget

TTFT (time to first token): dominated by prompt-length × prefill compute. Cache prefix or shorten prompt.
TPOT (time per output token): decoding-bound, ~30-100 ms depending on model size. Continuous batching + speculative decode here.
End-to-end: TTFT + N × TPOT.

10. Picking a stack

vLLM: open source default. Fast, broad model support, continuous batching, PagedAttention.
TGI (Hugging Face): similar features, easy with HF ecosystem.
TensorRT-LLM: NVIDIA-optimised, fastest on Hopper if you can build engines.
SGLang: best at structured generation + cache reuse for agents.
Hosted (Anthropic / OpenAI / Together / Fireworks): zero ops, predictable cost.

11. What to take away

"Why is your inference fast?" Strong answers: continuous batching + paged KV + quantised weights + prefix cache + speculative decode. Bonus: name the latency budget breakdown.

Key points

Resources

Practice Problem: Design In-Memory File System (Hard)

← previous

Day 20 — Idiomatic Python (and C#) — Type Hints, Protocols, Dataclasses, Pattern Matching

Day 22 — Data Modelling — Dimensional, Data Vault, OBT for the Lakehouse Era