LLM Serving Part 1 — vLLM, KV Cache, Continuous Batching
Session 30 of the 48-session learning series.
Date: Fri, 2026-07-03 · Time: 18:00–20:00 IST · Track: 🧠 LLMs & Agents (LLM) · Parent 28-day topic: Day 21 · Est. read: 2 h
Why this session matters
This is Session 30 of 48 in the LLM track. Pre-training gets the headlines; serving is what pays the bills. Understanding KV cache, continuous batching, and the memory layout of a transformer at inference time is what makes the difference between a 5000 inference bill for the same workload.
Agenda
- Why naive
model.generate()is terrible for production - The KV cache — what it stores, how big it gets, why it matters
- Continuous (in-flight) batching vs static batching
- PagedAttention — vLLM's killer idea
- Throughput vs latency — the eternal tradeoff
Pre-read (skim before the session)
- vLLM paper — Efficient Memory Management for LLM Serving with PagedAttention (2023)
- Orca — Continuous batching paper (OSDI 2022)
- HuggingFace — Text Generation Inference docs
- vLLM blog — first announcement
Deep dive
1. The inference workload — two distinct phases
[ PREFILL ] [ DECODE ]
input prompt (N tokens) one token at a time
heavy compute, parallel memory-bound, sequential
~100 ms for 1k tokens ~30 ms per token
- Prefill: compute-bound. You process the entire prompt in parallel through every transformer layer. GPU's tensor cores are saturated.
- Decode: memory-bound. One token forward at a time; bottleneck is reading the KV cache and weights from HBM, not compute.
These are different optimisation problems. Serving systems optimise both.
2. The KV cache — the silent budget eater
In a transformer, every attention head needs K and V tensors for every token in the context. To avoid recomputing them every step, we cache them.
Size per token, per layer:
2 * d_model * dtype_bytes
For Llama-3-70B (80 layers, hidden 8192, 64 heads, fp16):
- Per token KV cache:
2 * 80 * 8192 * 2 = ~2.5 MB - 4k context: ~10 GB just for one request's KV cache.
- 128k context: ~320 GB. Yes, GB. Just KV.
This is why long context is expensive even at low compute cost.
3. Why naive batching breaks down
Static batching = wait for N requests, run them together as a batch.
Problems:
- Padding — variable-length prompts pad to max length; lots of wasted GPU.
- Tail-blocking — fastest request waits for slowest. 16-token reply blocked by 512-token reply.
- Cold queue — low traffic = small batches = poor GPU utilisation.
Throughput collapses for typical LLM workloads.
4. Continuous (in-flight) batching
Orca / vLLM insight: at decode time, every request just needs one more token. So:
- Maintain a pool of active requests.
- Each decode step: gather all active requests, run them together, append one token each.
- When a request finishes, evict it; admit a new one immediately.
- New prefill requests interleave with ongoing decodes.
Result: GPU never idles waiting for the slowest request. Throughput often 4–10x over static batching.
Time →
req A: P D D D D D D D
req B: P D D D D D D D D
req C: P D D D D D D D
↑ A finishes, C admitted
5. PagedAttention — vLLM's contribution
Problem: KV cache for each request is variable-size, grows over time, lives next to other requests' caches. Contiguous allocation = massive fragmentation as requests come and go.
Solution: borrow virtual memory paging from OS.
- Divide KV cache into fixed-size blocks (e.g. 16 tokens).
- Each request has a block table mapping logical positions → physical blocks.
- Blocks allocated/freed on demand; no fragmentation.
Bonus: blocks can be shared across requests (prefix caching, beam search shares blocks).
Outcome: 2–4x more requests fit in same GPU memory; throughput up correspondingly.
6. Other serving systems
- TGI (HuggingFace Text Generation Inference) — production-ready, model-zoo wide, simple to deploy.
- TensorRT-LLM (NVIDIA) — fastest on NVIDIA hardware, painful build.
- DeepSpeed-Inference (Microsoft) — good for Mixture-of-Experts.
- SGLang — newer; structured outputs, prefix caching, often beats vLLM on certain workloads.
- MLX-LM / llama.cpp — local/edge inference; runs on Apple Silicon and CPU.
For most server-side production: vLLM or TGI. For best raw NVIDIA perf: TensorRT-LLM.
7. Prefix caching
Common scenario: every request shares a 2 KB system prompt. Recomputing prefill for that on every request wastes 100 ms each.
Prefix caching: hash the prefix tokens, cache the resulting KV blocks, reuse across requests. vLLM and SGLang support automatically when prefixes match.
Big win for: agent workloads (long system prompts), few-shot prompting, RAG (the same retrieved chunks reused).
8. Throughput vs latency
Two distinct service-level objectives:
- TTFT (time to first token) — user-perceived "did it start?"
- TPOT (time per output token) — user-perceived speed of streaming.
- End-to-end latency — total time.
- Throughput — total tokens/sec across all users.
You can't max all of them. Choices:
- Bigger batches → higher throughput, worse latency.
- Smaller batches → better latency, worse $/token.
- Speculative decoding (S34) → better TPOT, costs draft model time.
- Chunked prefill → smooth TTFT, slight throughput loss.
Define your SLO first. Tune second.
9. Tensor + pipeline parallelism
A 70B model doesn't fit on one 80 GB GPU at full precision. Options:
- Tensor parallelism (TP) — split each layer's matrices across GPUs; activations all-reduce after each layer. Low latency, high bandwidth needs.
- Pipeline parallelism (PP) — different layers on different GPUs; activations passed forward. Higher latency, less bandwidth.
- Expert parallelism (EP) — for MoE; experts on different GPUs.
For 8x A100: TP=8 typically fastest for inference.
10. Quantisation (preview of S34)
fp16 → int8 → int4 shrinks model + KV cache. Throughput up, quality usually within 1% on standard benchmarks. Most production deployments are int8 or int4 by 2026. Detailed in S34.
11. Cost math (do this in interview)
Llama-3-70B fp16 on 4× A100 80GB ($16/GPU-hr cloud):
- Cost: ~$64/hr.
- Realistic throughput with vLLM + continuous batching: ~3000 output tokens/sec.
- Cost per 1M tokens:
$64 / (3000 * 3600 / 1e6) = ~$5.93.
Compare OpenAI GPT-4o-mini: ~$0.60/1M out. Why? Better hardware, better software stack, scale, quantisation, batching efficiency.
The gap is closing fast — but "self-host LLM" is rarely cheaper than API at low traffic.
12. Reality check
Decision tree for serving:
- < 1M tokens/day, latency permissive → API (OpenAI, Anthropic, etc.).
- 1–100M tokens/day, batchable → API with caching + prompt minimisation.
- 100M+ tokens/day with stable load → self-host with vLLM/TGI on rented GPUs.
- Privacy/regulatory hard requirement → self-host regardless.
Self-hosting eats engineering time. Budget 1 FTE for serving infra at any meaningful scale.
Reading material
- vLLM — Efficient Memory Management (paper)
- Orca — Continuous Batching for Transformer-based Generative Inference
- HuggingFace TGI docs
- Anyscale — Continuous batching blog
In-depth research material
- SGLang — Structured generation language
- vLLM source code — block manager
- NVIDIA — TensorRT-LLM examples
- Tim Dettmers — Inference math
Video reference
▶︎ vLLM: Easy, Fast, and Cheap LLM Serving (CMU)
Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.
LeetCode — Design Bounded Blocking Queue
- Link: https://leetcode.com/problems/design-bounded-blocking-queue/
- Difficulty: Medium
- Why this problem: Producer-consumer with bounded buffer — same shape as the inference request queue feeding a continuous-batching scheduler.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Explain why prefill is compute-bound and decode is memory-bound.
- Compute the KV cache size for a given model and context length.
- Compare static vs continuous batching with a timing diagram.
- Describe how PagedAttention avoids KV cache fragmentation.
- List 3 SLOs (TTFT, TPOT, throughput) and explain how batch size trades them off.
- Solve
design-bounded-blocking-queue— semaphore-based producer-consumer.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.