Search Tech Journey

Find topics, journeys and posts

back to blog
ai mladvanced 12m2026-06-19

Day 21 — LLM Serving — vLLM, Continuous Batching, KV Cache, Speculative Decoding

Inference cost and latency are the dominant operational concerns for any LLM product. vLLM-style continuous batching gives 5-20× throughput; speculative decodin…

LLM inference economics are dominated by two things: how many tokens/sec/$ you generate and how fast the first token arrives. Modern serving stacks attack both with batching, KV-cache management, and speculative decoding.

🧠 Concept

Why it matters & the mental model.

1. The autoregressive constraint

Generation is sequential: token N+1 needs token N's output. Each step is a forward pass that mostly re-reads the KV cache → memory-bandwidth bound, not compute bound. So we can fit many concurrent requests in one forward pass nearly for free → batch!

2. Request-level vs continuous batching

  • Request-level (static): collect B requests, pad to max length, run together. Throughput-killer: short requests wait for the longest. Idle GPU 30-70% of the time.
  • Continuous (iteration-level): at every decoding step, schedule whatever requests have un-generated tokens. New requests join immediately; finished requests leave the slot.

Static wastes the gap after short requests finish; continuous reuses those slots immediately.

3. PagedAttention (vLLM)

The KV cache is huge and per-request. Naive allocation → fragmentation → "I have 30 GB free but can't fit a 1k-token cache". vLLM treats KV as pages (16 tokens each) in a virtual address space — like OS paging. Result: 4× more concurrent requests at same memory, near-zero internal fragmentation.

🛠 Deep Dive

Internals, code, architecture.

4. Quantisation

  • Weights: FP16 → INT8 → INT4. AWQ, GPTQ, bitsandbytes. INT4 ≈ 4× memory cut, < 1 pt MMLU loss on 7B+ models.
  • KV cache: FP8 / INT8 → smaller cache → more concurrency.
  • Activations: trickier, usually FP8 with FP8 GEMM on H100 / MI300.

5. Speculative decoding

Idea: a small "draft" model proposes K tokens cheaply; the big "verifier" model checks them all in one forward pass. Accepted prefix is emitted; rejected suffix is regenerated.

  • 2-3× wall-clock speedup, identical output distribution.
  • Variants: EAGLE (use the big model's hidden states), lookahead (no draft model, n-gram lookup), medusa (multiple decoding heads).

6. Prefix / prompt caching

For agents and chat apps, the system prompt + few-shot examples repeat across requests. Cache the K/V for that prefix. Anthropic / OpenAI expose this in their APIs (~90% discount on cached input tokens).

7. Multi-GPU strategies

  • Tensor parallelism: shard each layer across GPUs (Megatron). Fast within a node (NVLink).
  • Pipeline parallelism: layers across nodes. Bubbles in pipeline if not careful.
  • Expert parallelism: MoE experts across nodes.
  • Sequence parallelism: shard along sequence axis (used in long-context).

🚀 In Practice

Trade-offs, exercises, what to ship today.

8. Hardware reality

  • A100 80GB: workhorse. 312 TFLOPS BF16.
  • H100 / H200: FP8, transformer engine, much better $/token.
  • B200 / GB200: 2025 deployments, 30× perf/W vs A100 for inference.
  • AMD MI300X: huge HBM (192 GB), gaining mindshare via ROCm + vLLM.
  • Inferentia / TPU v5e: cheap-ish for compatible models.

9. Latency budget

  • TTFT (time to first token): dominated by prompt-length × prefill compute. Cache prefix or shorten prompt.
  • TPOT (time per output token): decoding-bound, ~30-100 ms depending on model size. Continuous batching + speculative decode here.
  • End-to-end: TTFT + N × TPOT.

10. Picking a stack

  • vLLM: open source default. Fast, broad model support, continuous batching, PagedAttention.
  • TGI (Hugging Face): similar features, easy with HF ecosystem.
  • TensorRT-LLM: NVIDIA-optimised, fastest on Hopper if you can build engines.
  • SGLang: best at structured generation + cache reuse for agents.
  • Hosted (Anthropic / OpenAI / Together / Fireworks): zero ops, predictable cost.

11. What to take away

"Why is your inference fast?" Strong answers: continuous batching + paged KV + quantised weights + prefix cache + speculative decode. Bonus: name the latency budget breakdown.

Key points

    Resources

    Practice Problem: Design In-Memory File System (Hard)