ai mlintermediate 12m2026-06-09

LLM Serving Part 1 — vLLM, KV Cache, Continuous Batching

Session 30 of the 48-session learning series.

Date: Fri, 2026-07-03 · Time: 18:00–20:00 IST · Track: 🧠 LLMs & Agents (LLM) · Parent 28-day topic: Day 21 · Est. read: 2 h

Why this session matters

This is Session 30 of 48 in the LLM track. Pre-training gets the headlines; serving is what pays the bills. Understanding KV cache, continuous batching, and the memory layout of a transformer at inference time is what makes the difference between a $50 and$ 5000 inference bill for the same workload.

Agenda

Why naive model.generate() is terrible for production
The KV cache — what it stores, how big it gets, why it matters
Continuous (in-flight) batching vs static batching
PagedAttention — vLLM's killer idea
Throughput vs latency — the eternal tradeoff

Pre-read (skim before the session)

Deep dive

1. The inference workload — two distinct phases

[ PREFILL ]                    [ DECODE ]
input prompt (N tokens)        one token at a time
heavy compute, parallel        memory-bound, sequential
~100 ms for 1k tokens          ~30 ms per token

Prefill: compute-bound. You process the entire prompt in parallel through every transformer layer. GPU's tensor cores are saturated.
Decode: memory-bound. One token forward at a time; bottleneck is reading the KV cache and weights from HBM, not compute.

These are different optimisation problems. Serving systems optimise both.

2. The KV cache — the silent budget eater

In a transformer, every attention head needs K and V tensors for every token in the context. To avoid recomputing them every step, we cache them.

Size per token, per layer:

2 * d_model * dtype_bytes

For Llama-3-70B (80 layers, hidden 8192, 64 heads, fp16):

Per token KV cache: 2 * 80 * 8192 * 2 = ~2.5 MB
4k context: ~10 GB just for one request's KV cache.
128k context: ~320 GB. Yes, GB. Just KV.

This is why long context is expensive even at low compute cost.

3. Why naive batching breaks down

Static batching = wait for N requests, run them together as a batch.

Problems:

Padding — variable-length prompts pad to max length; lots of wasted GPU.
Tail-blocking — fastest request waits for slowest. 16-token reply blocked by 512-token reply.
Cold queue — low traffic = small batches = poor GPU utilisation.

Throughput collapses for typical LLM workloads.

4. Continuous (in-flight) batching

Orca / vLLM insight: at decode time, every request just needs one more token. So:

Maintain a pool of active requests.
Each decode step: gather all active requests, run them together, append one token each.
When a request finishes, evict it; admit a new one immediately.
New prefill requests interleave with ongoing decodes.

Result: GPU never idles waiting for the slowest request. Throughput often 4–10x over static batching.

Time →
req A: P D D D D D D D
req B:     P D D D D D D D D
req C:           P D D D D D D D
                       ↑ A finishes, C admitted

5. PagedAttention — vLLM's contribution

Problem: KV cache for each request is variable-size, grows over time, lives next to other requests' caches. Contiguous allocation = massive fragmentation as requests come and go.

Solution: borrow virtual memory paging from OS.

Divide KV cache into fixed-size blocks (e.g. 16 tokens).
Each request has a block table mapping logical positions → physical blocks.
Blocks allocated/freed on demand; no fragmentation.

Bonus: blocks can be shared across requests (prefix caching, beam search shares blocks).

Outcome: 2–4x more requests fit in same GPU memory; throughput up correspondingly.

6. Other serving systems

TGI (HuggingFace Text Generation Inference) — production-ready, model-zoo wide, simple to deploy.
TensorRT-LLM (NVIDIA) — fastest on NVIDIA hardware, painful build.
DeepSpeed-Inference (Microsoft) — good for Mixture-of-Experts.
SGLang — newer; structured outputs, prefix caching, often beats vLLM on certain workloads.
MLX-LM / llama.cpp — local/edge inference; runs on Apple Silicon and CPU.

For most server-side production: vLLM or TGI. For best raw NVIDIA perf: TensorRT-LLM.

7. Prefix caching

Common scenario: every request shares a 2 KB system prompt. Recomputing prefill for that on every request wastes 100 ms each.

Prefix caching: hash the prefix tokens, cache the resulting KV blocks, reuse across requests. vLLM and SGLang support automatically when prefixes match.

Big win for: agent workloads (long system prompts), few-shot prompting, RAG (the same retrieved chunks reused).

8. Throughput vs latency

Two distinct service-level objectives:

TTFT (time to first token) — user-perceived "did it start?"
TPOT (time per output token) — user-perceived speed of streaming.
End-to-end latency — total time.
Throughput — total tokens/sec across all users.

You can't max all of them. Choices:

Bigger batches → higher throughput, worse latency.
Smaller batches → better latency, worse $/token.
Speculative decoding (S34) → better TPOT, costs draft model time.
Chunked prefill → smooth TTFT, slight throughput loss.

Define your SLO first. Tune second.

9. Tensor + pipeline parallelism

A 70B model doesn't fit on one 80 GB GPU at full precision. Options:

Tensor parallelism (TP) — split each layer's matrices across GPUs; activations all-reduce after each layer. Low latency, high bandwidth needs.
Pipeline parallelism (PP) — different layers on different GPUs; activations passed forward. Higher latency, less bandwidth.
Expert parallelism (EP) — for MoE; experts on different GPUs.

For 8x A100: TP=8 typically fastest for inference.

10. Quantisation (preview of S34)

fp16 → int8 → int4 shrinks model + KV cache. Throughput up, quality usually within 1% on standard benchmarks. Most production deployments are int8 or int4 by 2026. Detailed in S34.

11. Cost math (do this in interview)

Llama-3-70B fp16 on 4× A100 80GB ($16/GPU-hr cloud):

Cost: ~$64/hr.
Realistic throughput with vLLM + continuous batching: ~3000 output tokens/sec.
Cost per 1M tokens: $64 / (3000 * 3600 / 1e6) = ~$5.93.

Compare OpenAI GPT-4o-mini: ~$0.60/1M out. Why? Better hardware, better software stack, scale, quantisation, batching efficiency.

The gap is closing fast — but "self-host LLM" is rarely cheaper than API at low traffic.

12. Reality check

Decision tree for serving:

< 1M tokens/day, latency permissive → API (OpenAI, Anthropic, etc.).
1–100M tokens/day, batchable → API with caching + prompt minimisation.
100M+ tokens/day with stable load → self-host with vLLM/TGI on rented GPUs.
Privacy/regulatory hard requirement → self-host regardless.

Self-hosting eats engineering time. Budget 1 FTE for serving infra at any meaningful scale.

Link: https://leetcode.com/problems/design-bounded-blocking-queue/
Difficulty: Medium
Why this problem: Producer-consumer with bounded buffer — same shape as the inference request queue feeding a continuous-batching scheduler.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Explain why prefill is compute-bound and decode is memory-bound.
Compute the KV cache size for a given model and context length.
Compare static vs continuous batching with a timing diagram.
Describe how PagedAttention avoids KV cache fragmentation.
List 3 SLOs (TTFT, TPOT, throughput) and explain how batch size trades them off.
Solve design-bounded-blocking-queue — semaphore-based producer-consumer.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

Idiomatic Python (and a Touch of C++) — Type Hints, Protocols, Dataclasses

API Design — REST, GraphQL, gRPC, Versioning, Pagination, Errors