ai mladvanced 12m2026-06-14

Day 16 — LLM Evaluation — Benchmarks, LLM-as-Judge, RAGAS, Inspect

If you can't measure it, you can't ship it. Modern LLM eval is its own discipline — task-specific benchmarks, golden sets, LLM judges with rubrics, and slice-le…

Evaluation is the single biggest gap between "demo" and "product". A good eval harness gives you a confident "this change is +3.2 pts on hard-multi-hop and -0.5 pts on chit-chat" instead of "feels better".

🧠 Concept

Why it matters & the mental model.

1. Three eval regimes

Reference-based (gold answer exists): exact match, F1, BLEU, ROUGE, BERTScore. Cheap, brittle on open-ended.
Reference-free with judge: LLM-as-judge scores response against a rubric (helpfulness, faithfulness, safety). Cheap to scale.
Human eval: 50-200 examples scored by humans, often pairwise (A vs B). Slow, the ground truth that calibrates everything else.

2. Build a golden set first

50-200 examples is enough to start. Stratify by:

Difficulty (easy / medium / hard).
Slice (task type, domain, length, language).
Failure modes you've seen in production.

Without slicing the score averages wins on easy cases against losses on the cases that matter.

3. LLM-as-judge — make it reliable

Three failure modes:

Position bias (prefers first or second). Mitigate: swap order, average.
Verbosity bias (prefers longer). Mitigate: rubric explicit on conciseness.
Self-preference (model prefers its own outputs). Mitigate: use a different model as judge; calibrate against humans.

Rubric structure: 1-5 scale, explicit criteria, examples of each score, JSON output. Calibrate on 20 human-labelled examples; if judge agreement < 70% with humans, refine rubric.

🛠 Deep Dive

Internals, code, architecture.

4. RAG-specific metrics (RAGAS)

Faithfulness: claim in answer is supported by retrieved context.
Answer relevance: answer addresses the question.
Context precision: ranking quality of retrieved chunks.
Context recall: did retrieval surface all needed chunks (needs ground truth).

5. Agent-specific

Task success rate: does final state match expected?
Trajectory quality: did it take the right steps?
Tool-call correctness: right tool, right args, on time.
Cost / steps.

6. Public benchmarks worth knowing

MMLU / MMLU-Pro: general knowledge / reasoning.
GSM8K / MATH / AIME: math.
HumanEval / MBPP / SWE-bench: coding.
MTEB: embeddings.
MT-Bench / Arena: chat preference. Use them for model selection, never for product readiness — your task isn't on the leaderboard.

🚀 In Practice

Trade-offs, exercises, what to ship today.

7. Online / production eval

Implicit signals: thumbs up/down, copy-rate, re-prompt rate, abandonment.
Online A/B: route 50/50, measure business metric (conversion, time-to-resolution).
Shadow eval: run new model on prod traces offline, judge, compare.

8. Regression suite in CI

Treat eval like tests. On every PR:

Run 200-example golden set.
Pairwise judge new vs prod.
Fail PR if win rate < 50% on any critical slice.

🎥 DeepLearning.AI — Evaluating and Debugging LLM Applications
📖 RAGAS docs
📖 Anthropic — Building evals
📖 Inspect AI (UK AISI)

Practice Problem: Word Search (Medium)

← previous

Day 15 — Memory Model & Garbage Collection — Heap, GC, Leaks, Profiling

Day 17 — Streaming with Flink / Spark Structured Streaming — Watermarks & Windows