Search Tech Journey

Find topics, journeys and posts

back to blog
ai mladvanced 12m2026-06-14

Day 16 — LLM Evaluation — Benchmarks, LLM-as-Judge, RAGAS, Inspect

If you can't measure it, you can't ship it. Modern LLM eval is its own discipline — task-specific benchmarks, golden sets, LLM judges with rubrics, and slice-le…

Evaluation is the single biggest gap between "demo" and "product". A good eval harness gives you a confident "this change is +3.2 pts on hard-multi-hop and -0.5 pts on chit-chat" instead of "feels better".

🧠 Concept

Why it matters & the mental model.

1. Three eval regimes

  • Reference-based (gold answer exists): exact match, F1, BLEU, ROUGE, BERTScore. Cheap, brittle on open-ended.
  • Reference-free with judge: LLM-as-judge scores response against a rubric (helpfulness, faithfulness, safety). Cheap to scale.
  • Human eval: 50-200 examples scored by humans, often pairwise (A vs B). Slow, the ground truth that calibrates everything else.

2. Build a golden set first

50-200 examples is enough to start. Stratify by:

  • Difficulty (easy / medium / hard).
  • Slice (task type, domain, length, language).
  • Failure modes you've seen in production.

Without slicing the score averages wins on easy cases against losses on the cases that matter.

3. LLM-as-judge — make it reliable

Three failure modes:

  • Position bias (prefers first or second). Mitigate: swap order, average.
  • Verbosity bias (prefers longer). Mitigate: rubric explicit on conciseness.
  • Self-preference (model prefers its own outputs). Mitigate: use a different model as judge; calibrate against humans.

Rubric structure: 1-5 scale, explicit criteria, examples of each score, JSON output. Calibrate on 20 human-labelled examples; if judge agreement < 70% with humans, refine rubric.

🛠 Deep Dive

Internals, code, architecture.

4. RAG-specific metrics (RAGAS)

  • Faithfulness: claim in answer is supported by retrieved context.
  • Answer relevance: answer addresses the question.
  • Context precision: ranking quality of retrieved chunks.
  • Context recall: did retrieval surface all needed chunks (needs ground truth).

5. Agent-specific

  • Task success rate: does final state match expected?
  • Trajectory quality: did it take the right steps?
  • Tool-call correctness: right tool, right args, on time.
  • Cost / steps.

6. Public benchmarks worth knowing

  • MMLU / MMLU-Pro: general knowledge / reasoning.
  • GSM8K / MATH / AIME: math.
  • HumanEval / MBPP / SWE-bench: coding.
  • MTEB: embeddings.
  • MT-Bench / Arena: chat preference. Use them for model selection, never for product readiness — your task isn't on the leaderboard.

🚀 In Practice

Trade-offs, exercises, what to ship today.

7. Online / production eval

  • Implicit signals: thumbs up/down, copy-rate, re-prompt rate, abandonment.
  • Online A/B: route 50/50, measure business metric (conversion, time-to-resolution).
  • Shadow eval: run new model on prod traces offline, judge, compare.

8. Regression suite in CI

Treat eval like tests. On every PR:

  1. Run 200-example golden set.
  2. Pairwise judge new vs prod.
  3. Fail PR if win rate < 50% on any critical slice.

9. The eval flywheel

Production traces → label failures → add to golden set → fix → re-eval. This loop is where most of the wins come from after the obvious gains.

10. What to take away

"How do you know your LLM app is working?" Strong answers: stratified golden set, mix of metric types, judge with calibration, slice reporting, regression in CI, online experiment as final word.

Key points

    Resources

    Practice Problem: Word Search (Medium)