Search Tech Journey

Find topics, journeys and posts

back to blog
ai mlintermediate 12m2026-06-09

LLM Serving Part 1 — vLLM, KV Cache, Continuous Batching

Session 30 of the 48-session learning series.

Date: Fri, 2026-07-03 · Time: 18:00–20:00 IST · Track: 🧠 LLMs & Agents (LLM) · Parent 28-day topic: Day 21 · Est. read: 2 h

Why this session matters

This is Session 30 of 48 in the LLM track. Pre-training gets the headlines; serving is what pays the bills. Understanding KV cache, continuous batching, and the memory layout of a transformer at inference time is what makes the difference between a 50and50 and 5000 inference bill for the same workload.

Agenda

  • Why naive model.generate() is terrible for production
  • The KV cache — what it stores, how big it gets, why it matters
  • Continuous (in-flight) batching vs static batching
  • PagedAttention — vLLM's killer idea
  • Throughput vs latency — the eternal tradeoff

Pre-read (skim before the session)

Deep dive

1. The inference workload — two distinct phases

[ PREFILL ]                    [ DECODE ]
input prompt (N tokens)        one token at a time
heavy compute, parallel        memory-bound, sequential
~100 ms for 1k tokens          ~30 ms per token
  • Prefill: compute-bound. You process the entire prompt in parallel through every transformer layer. GPU's tensor cores are saturated.
  • Decode: memory-bound. One token forward at a time; bottleneck is reading the KV cache and weights from HBM, not compute.

These are different optimisation problems. Serving systems optimise both.

2. The KV cache — the silent budget eater

In a transformer, every attention head needs K and V tensors for every token in the context. To avoid recomputing them every step, we cache them.

Size per token, per layer:

2 * d_model * dtype_bytes

For Llama-3-70B (80 layers, hidden 8192, 64 heads, fp16):

  • Per token KV cache: 2 * 80 * 8192 * 2 = ~2.5 MB
  • 4k context: ~10 GB just for one request's KV cache.
  • 128k context: ~320 GB. Yes, GB. Just KV.

This is why long context is expensive even at low compute cost.

3. Why naive batching breaks down

Static batching = wait for N requests, run them together as a batch.

Problems:

  • Padding — variable-length prompts pad to max length; lots of wasted GPU.
  • Tail-blocking — fastest request waits for slowest. 16-token reply blocked by 512-token reply.
  • Cold queue — low traffic = small batches = poor GPU utilisation.

Throughput collapses for typical LLM workloads.

4. Continuous (in-flight) batching

Orca / vLLM insight: at decode time, every request just needs one more token. So:

  • Maintain a pool of active requests.
  • Each decode step: gather all active requests, run them together, append one token each.
  • When a request finishes, evict it; admit a new one immediately.
  • New prefill requests interleave with ongoing decodes.

Result: GPU never idles waiting for the slowest request. Throughput often 4–10x over static batching.

Time →
req A: P D D D D D D D
req B:     P D D D D D D D D
req C:           P D D D D D D D
                       ↑ A finishes, C admitted

5. PagedAttention — vLLM's contribution

Problem: KV cache for each request is variable-size, grows over time, lives next to other requests' caches. Contiguous allocation = massive fragmentation as requests come and go.

Solution: borrow virtual memory paging from OS.

  • Divide KV cache into fixed-size blocks (e.g. 16 tokens).
  • Each request has a block table mapping logical positions → physical blocks.
  • Blocks allocated/freed on demand; no fragmentation.

Bonus: blocks can be shared across requests (prefix caching, beam search shares blocks).

Outcome: 2–4x more requests fit in same GPU memory; throughput up correspondingly.

6. Other serving systems

  • TGI (HuggingFace Text Generation Inference) — production-ready, model-zoo wide, simple to deploy.
  • TensorRT-LLM (NVIDIA) — fastest on NVIDIA hardware, painful build.
  • DeepSpeed-Inference (Microsoft) — good for Mixture-of-Experts.
  • SGLang — newer; structured outputs, prefix caching, often beats vLLM on certain workloads.
  • MLX-LM / llama.cpp — local/edge inference; runs on Apple Silicon and CPU.

For most server-side production: vLLM or TGI. For best raw NVIDIA perf: TensorRT-LLM.

7. Prefix caching

Common scenario: every request shares a 2 KB system prompt. Recomputing prefill for that on every request wastes 100 ms each.

Prefix caching: hash the prefix tokens, cache the resulting KV blocks, reuse across requests. vLLM and SGLang support automatically when prefixes match.

Big win for: agent workloads (long system prompts), few-shot prompting, RAG (the same retrieved chunks reused).

8. Throughput vs latency

Two distinct service-level objectives:

  • TTFT (time to first token) — user-perceived "did it start?"
  • TPOT (time per output token) — user-perceived speed of streaming.
  • End-to-end latency — total time.
  • Throughput — total tokens/sec across all users.

You can't max all of them. Choices:

  • Bigger batches → higher throughput, worse latency.
  • Smaller batches → better latency, worse $/token.
  • Speculative decoding (S34) → better TPOT, costs draft model time.
  • Chunked prefill → smooth TTFT, slight throughput loss.

Define your SLO first. Tune second.

9. Tensor + pipeline parallelism

A 70B model doesn't fit on one 80 GB GPU at full precision. Options:

  • Tensor parallelism (TP) — split each layer's matrices across GPUs; activations all-reduce after each layer. Low latency, high bandwidth needs.
  • Pipeline parallelism (PP) — different layers on different GPUs; activations passed forward. Higher latency, less bandwidth.
  • Expert parallelism (EP) — for MoE; experts on different GPUs.

For 8x A100: TP=8 typically fastest for inference.

10. Quantisation (preview of S34)

fp16int8int4 shrinks model + KV cache. Throughput up, quality usually within 1% on standard benchmarks. Most production deployments are int8 or int4 by 2026. Detailed in S34.

11. Cost math (do this in interview)

Llama-3-70B fp16 on 4× A100 80GB ($16/GPU-hr cloud):

  • Cost: ~$64/hr.
  • Realistic throughput with vLLM + continuous batching: ~3000 output tokens/sec.
  • Cost per 1M tokens: $64 / (3000 * 3600 / 1e6) = ~$5.93.

Compare OpenAI GPT-4o-mini: ~$0.60/1M out. Why? Better hardware, better software stack, scale, quantisation, batching efficiency.

The gap is closing fast — but "self-host LLM" is rarely cheaper than API at low traffic.

12. Reality check

Decision tree for serving:

  • < 1M tokens/day, latency permissive → API (OpenAI, Anthropic, etc.).
  • 1–100M tokens/day, batchable → API with caching + prompt minimisation.
  • 100M+ tokens/day with stable load → self-host with vLLM/TGI on rented GPUs.
  • Privacy/regulatory hard requirement → self-host regardless.

Self-hosting eats engineering time. Budget 1 FTE for serving infra at any meaningful scale.

Reading material

In-depth research material

Video reference

▶︎ vLLM: Easy, Fast, and Cheap LLM Serving (CMU)

Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.

LeetCode — Design Bounded Blocking Queue

Post-session checklist

By the end of this session you should be able to:

  • Explain why prefill is compute-bound and decode is memory-bound.
  • Compute the KV cache size for a given model and context length.
  • Compare static vs continuous batching with a timing diagram.
  • Describe how PagedAttention avoids KV cache fragmentation.
  • List 3 SLOs (TTFT, TPOT, throughput) and explain how batch size trades them off.
  • Solve design-bounded-blocking-queue — semaphore-based producer-consumer.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.