Day 16 — LLM Evaluation — Benchmarks, LLM-as-Judge, RAGAS, Inspect
If you can't measure it, you can't ship it. Modern LLM eval is its own discipline — task-specific benchmarks, golden sets, LLM judges with rubrics, and slice-le…
Evaluation is the single biggest gap between "demo" and "product". A good eval harness gives you a confident "this change is +3.2 pts on hard-multi-hop and -0.5 pts on chit-chat" instead of "feels better".
🧠 Concept
Why it matters & the mental model.
1. Three eval regimes
- Reference-based (gold answer exists): exact match, F1, BLEU, ROUGE, BERTScore. Cheap, brittle on open-ended.
- Reference-free with judge: LLM-as-judge scores response against a rubric (helpfulness, faithfulness, safety). Cheap to scale.
- Human eval: 50-200 examples scored by humans, often pairwise (A vs B). Slow, the ground truth that calibrates everything else.
2. Build a golden set first
50-200 examples is enough to start. Stratify by:
- Difficulty (easy / medium / hard).
- Slice (task type, domain, length, language).
- Failure modes you've seen in production.
Without slicing the score averages wins on easy cases against losses on the cases that matter.
3. LLM-as-judge — make it reliable
Three failure modes:
- Position bias (prefers first or second). Mitigate: swap order, average.
- Verbosity bias (prefers longer). Mitigate: rubric explicit on conciseness.
- Self-preference (model prefers its own outputs). Mitigate: use a different model as judge; calibrate against humans.
Rubric structure: 1-5 scale, explicit criteria, examples of each score, JSON output. Calibrate on 20 human-labelled examples; if judge agreement < 70% with humans, refine rubric.
🛠 Deep Dive
Internals, code, architecture.
4. RAG-specific metrics (RAGAS)
- Faithfulness: claim in answer is supported by retrieved context.
- Answer relevance: answer addresses the question.
- Context precision: ranking quality of retrieved chunks.
- Context recall: did retrieval surface all needed chunks (needs ground truth).
5. Agent-specific
- Task success rate: does final state match expected?
- Trajectory quality: did it take the right steps?
- Tool-call correctness: right tool, right args, on time.
- Cost / steps.
6. Public benchmarks worth knowing
- MMLU / MMLU-Pro: general knowledge / reasoning.
- GSM8K / MATH / AIME: math.
- HumanEval / MBPP / SWE-bench: coding.
- MTEB: embeddings.
- MT-Bench / Arena: chat preference. Use them for model selection, never for product readiness — your task isn't on the leaderboard.
🚀 In Practice
Trade-offs, exercises, what to ship today.
7. Online / production eval
- Implicit signals: thumbs up/down, copy-rate, re-prompt rate, abandonment.
- Online A/B: route 50/50, measure business metric (conversion, time-to-resolution).
- Shadow eval: run new model on prod traces offline, judge, compare.
8. Regression suite in CI
Treat eval like tests. On every PR:
- Run 200-example golden set.
- Pairwise judge new vs prod.
- Fail PR if win rate < 50% on any critical slice.
9. The eval flywheel
Production traces → label failures → add to golden set → fix → re-eval. This loop is where most of the wins come from after the obvious gains.
10. What to take away
"How do you know your LLM app is working?" Strong answers: stratified golden set, mix of metric types, judge with calibration, slice reporting, regression in CI, online experiment as final word.
Resources
- 🎥 DeepLearning.AI — Evaluating and Debugging LLM Applications
- 📖 RAGAS docs
- 📖 Anthropic — Building evals
- 📖 Inspect AI (UK AISI)
Practice Problem: Word Search (Medium)