Search Tech Journey

Find topics, journeys and posts

back to blog
ai mlintermediate 12m2026-06-09

LLM Evaluation — Benchmarks, LLM-as-Judge, RAGAS, Inspect

Session 25 of the 48-session learning series.

Date: Sun, 2026-06-28 · Time: 14:30–16:30 IST · Track: 🧠 LLMs & Agents (LLM) · Parent 28-day topic: Day 16 · Est. read: 2 h

Why this session matters

This is Session 25 of 48 in the LLM track. Evaluation is the unsexy half of LLM work — and the half that decides whether your shiny demo survives contact with real users. You can't ship what you can't measure.

Agenda

  • Why evaluating LLMs is genuinely harder than classical ML
  • Static benchmarks — MMLU, HumanEval, GSM8K — and their limits
  • LLM-as-judge — pairwise, score-based, the bias gotchas
  • RAGAS, TruLens — RAG-specific eval frameworks
  • Inspect, OpenAI Evals — building your own eval harness

Pre-read (skim before the session)

Deep dive

1. Why LLM eval is hard

Classical ML: one label, one prediction, clean metric (accuracy, AUC). LLMs:

  • Open-ended output — no single right answer. "Summarise this" has 1000 acceptable answers.
  • Multi-dimensional quality — factuality, fluency, safety, tone, helpfulness; can trade off.
  • Reference-free settings — you often don't have a ground truth ("write a marketing email").
  • Long-tail failures — the model is fine 99% of the time and catastrophically wrong on edge cases.
  • Contamination — test set leaked into pre-training corpus → benchmark inflated.

You can't compute "accuracy" against a list of strings and call it a day.

2. The 4 layers of eval

[ Unit-level ]      → does the prompt produce well-formed JSON?
[ Behavioural ]     → does it refuse the jailbreak? answer the maths question?
[ Capability ]      → MMLU, HumanEval, MATH on held-out data
[ User-outcome ]    → did the user accept the suggestion? task success?

The bottom two cost the most and matter the most. Most teams over-invest in capability benchmarks and under-invest in user-outcome metrics.

3. Static benchmarks — useful, but...

BenchmarkTestsGotcha
MMLU57-subject multiple choiceContamination; rote knowledge
HumanEvalPython function from docstringTiny (164 tasks); plateaued
MATHCompetition mathsReasoning + arithmetic mix
GSM8KGrade-school word problemsLargely solved; check GSM-Symbolic
MT-BenchMulti-turn chat, LLM-judgedJudge bias; small
HELMBroad suiteHeavy, dated; good audit trail
BBH"Hard" sub-tasks of BIG-BenchMixed quality
ARC-AGIVisual puzzlesThe reasoning bar; expensive to run

Rule: use benchmarks to exclude models, not to pick them. If MMLU is < 60% your candidate, drop it. Above some threshold, benchmarks stop correlating with what you actually need.

4. LLM-as-Judge — the workhorse

Use a strong model (GPT-4-class) to evaluate outputs of another model. Three flavours:

  • Pairwise — show judge two outputs A and B, ask which is better. Most reliable.
  • Scalar — rate output 1–10 on dimension X. Easy but noisy.
  • Rubric-based — multi-criteria scoring against a written rubric. Best for production.

Pairwise example:

You are a judge. Compare two assistant responses to the same prompt.
Decide which is better on: helpfulness, accuracy, conciseness.
Output JSON: {"winner": "A" | "B" | "tie", "reason": "..."}
Prompt: {user_prompt}
Response A: {output_a}
Response B: {output_b}

5. Judge biases (this will burn you)

  • Position bias — first option wins more often. Mitigation: swap order, average.
  • Length bias — longer answers preferred even when worse. Mitigation: length-controlled rubric.
  • Self-preference — model judges its own family higher. Mitigation: use a different model family as judge.
  • Verbosity bias — flowery > terse. Mitigation: explicit rubric criterion.
  • Authority bias — confident wrong > tentative correct. Mitigation: factuality sub-score.

Validate the judge against ~200 human-labelled pairs before you trust it for thousands of evals.

6. RAGAS — RAG-specific metrics

For retrieval-augmented systems:

MetricWhat it measures
faithfulnessAre all claims in answer supported by context?
answer_relevancyDoes the answer actually address the question?
context_precisionAre retrieved chunks actually useful?
context_recallDid retrieval find all needed info?
answer_correctnessDoes answer match ground truth (when available)?
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
result = evaluate(
    dataset=eval_dataset,
    metrics=[faithfulness, answer_relevancy],
)

These give you a numeric, comparable score per retrieval/generation strategy. Crucial for the eval loop in RAG (see S13).

7. TruLens, DeepEval, Inspect

  • TruLens — instrumented runtime; collects traces; defines "feedback functions" (RAGAS-ish).
  • DeepEval — pytest-style assertions for LLM output. assert_relevance(), assert_no_hallucination().
  • Inspect (UK AISI) — capability + safety eval framework, used for frontier-model red-teaming. Plugin model.
  • OpenAI Evals — open-source harness, plug-in eval types.
  • Promptfoo — YAML-driven A/B prompt eval; great for prompt regression.

8. Building your own eval set

Don't skip this. A 100-example, hand-crafted eval set tailored to your product beats any public benchmark.

Process:

  1. Mine real user prompts (anonymise!).
  2. Bucket by intent (10–15 buckets is enough).
  3. For each bucket, write 5–10 gold-standard answers (or rubrics).
  4. Add adversarial examples — your top 20 known failure modes.
  5. Version the set; checksum it; track it like a dataset, not a doc.

Re-run the whole set on every model/prompt change. Track scores in MLflow.

9. Online vs offline eval

  • Offline — fixed dataset, runs on every release. Fast, repeatable, but stale.
  • Online — production traffic, label by user reaction (👍/👎, accept-rate, edit-distance to final answer). Slow, biased by current cohort, but real.

Both. Offline catches regressions before ship; online catches reality drift.

10. The eval flywheel

[ Real user fails ] → [ Add to eval set ] → [ Run all candidates ] → [ Pick winner ] → [ Ship ]
        ▲                                                                                   │
        └───────────────────────────────────────────────────────────────────────────────────┘

Every bug becomes a permanent test. After 6 months you have 2000 examples that capture exactly what your product needs. That set is your moat.

11. Cost-aware eval

LLM-as-judge is expensive. Budgeting tips:

  • Sample, don't run-all-on-everything.
  • Cache judge results keyed on (prompt, output_a, output_b, judge_model).
  • Use a cheaper "screening" judge → escalate ambiguous cases to a stronger judge.
  • Score per-dollar: improvements aren't free; track quality_delta / cost_delta.

12. Reality check

A pragmatic minimum eval stack for a startup:

  • 100 hand-crafted prompts with rubrics (versioned in git).
  • A pytest job that runs them on every PR using promptfoo or homemade.
  • Pairwise LLM-judge with order-swap, against GPT-4o.
  • Production logging of (input, output, model_version, user_feedback).
  • Weekly review of bottom-decile responses → add to eval set.

This is enough. Buy the platform when you outgrow it, not before.

Reading material

In-depth research material

Video reference

▶︎ LLM Evaluation Explained (Hamel Husain)

Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.

LeetCode — Evaluate Division

  • Link: https://leetcode.com/problems/evaluate-division/
  • Difficulty: Medium
  • Why this problem: Build a graph of equations and answer arbitrary queries — exact shape of evaluating one model output against a chain of judge criteria.
  • Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

  • Explain why open-ended LLM output makes eval qualitatively harder than classical ML.
  • List the 4 layers of eval (unit, behavioural, capability, user-outcome).
  • Design a pairwise LLM-judge prompt with bias mitigations.
  • Pick the right RAGAS metric (faithfulness vs context_precision vs answer_relevancy) for a given failure.
  • Sketch the 5-step "eval flywheel" loop.
  • Solve evaluate-division — graph traversal of weighted edges; mirrors chained eval criteria.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.