ai mlintermediate 12m2026-06-09

LLM Evaluation — Benchmarks, LLM-as-Judge, RAGAS, Inspect

Session 25 of the 48-session learning series.

Date: Sun, 2026-06-28 · Time: 14:30–16:30 IST · Track: 🧠 LLMs & Agents (LLM) · Parent 28-day topic: Day 16 · Est. read: 2 h

Why this session matters

This is Session 25 of 48 in the LLM track. Evaluation is the unsexy half of LLM work — and the half that decides whether your shiny demo survives contact with real users. You can't ship what you can't measure.

Agenda

Why evaluating LLMs is genuinely harder than classical ML
Static benchmarks — MMLU, HumanEval, GSM8K — and their limits
LLM-as-judge — pairwise, score-based, the bias gotchas
RAGAS, TruLens — RAG-specific eval frameworks
Inspect, OpenAI Evals — building your own eval harness

Pre-read (skim before the session)

Deep dive

1. Why LLM eval is hard

Classical ML: one label, one prediction, clean metric (accuracy, AUC). LLMs:

Open-ended output — no single right answer. "Summarise this" has 1000 acceptable answers.
Multi-dimensional quality — factuality, fluency, safety, tone, helpfulness; can trade off.
Reference-free settings — you often don't have a ground truth ("write a marketing email").
Long-tail failures — the model is fine 99% of the time and catastrophically wrong on edge cases.
Contamination — test set leaked into pre-training corpus → benchmark inflated.

You can't compute "accuracy" against a list of strings and call it a day.

2. The 4 layers of eval

[ Unit-level ]      → does the prompt produce well-formed JSON?
[ Behavioural ]     → does it refuse the jailbreak? answer the maths question?
[ Capability ]      → MMLU, HumanEval, MATH on held-out data
[ User-outcome ]    → did the user accept the suggestion? task success?

The bottom two cost the most and matter the most. Most teams over-invest in capability benchmarks and under-invest in user-outcome metrics.

3. Static benchmarks — useful, but...

Benchmark	Tests	Gotcha
MMLU	57-subject multiple choice	Contamination; rote knowledge
HumanEval	Python function from docstring	Tiny (164 tasks); plateaued
MATH	Competition maths	Reasoning + arithmetic mix
GSM8K	Grade-school word problems	Largely solved; check `GSM-Symbolic`
MT-Bench	Multi-turn chat, LLM-judged	Judge bias; small
HELM	Broad suite	Heavy, dated; good audit trail
BBH	"Hard" sub-tasks of BIG-Bench	Mixed quality
ARC-AGI	Visual puzzles	The reasoning bar; expensive to run

Rule: use benchmarks to exclude models, not to pick them. If MMLU is < 60% your candidate, drop it. Above some threshold, benchmarks stop correlating with what you actually need.

4. LLM-as-Judge — the workhorse

Use a strong model (GPT-4-class) to evaluate outputs of another model. Three flavours:

Pairwise — show judge two outputs A and B, ask which is better. Most reliable.
Scalar — rate output 1–10 on dimension X. Easy but noisy.
Rubric-based — multi-criteria scoring against a written rubric. Best for production.

Pairwise example:

You are a judge. Compare two assistant responses to the same prompt.
Decide which is better on: helpfulness, accuracy, conciseness.
Output JSON: {"winner": "A" | "B" | "tie", "reason": "..."}
Prompt: {user_prompt}
Response A: {output_a}
Response B: {output_b}

5. Judge biases (this will burn you)

Position bias — first option wins more often. Mitigation: swap order, average.
Length bias — longer answers preferred even when worse. Mitigation: length-controlled rubric.
Self-preference — model judges its own family higher. Mitigation: use a different model family as judge.
Verbosity bias — flowery > terse. Mitigation: explicit rubric criterion.
Authority bias — confident wrong > tentative correct. Mitigation: factuality sub-score.

Validate the judge against ~200 human-labelled pairs before you trust it for thousands of evals.

6. RAGAS — RAG-specific metrics

For retrieval-augmented systems:

Metric	What it measures
`faithfulness`	Are all claims in answer supported by context?
`answer_relevancy`	Does the answer actually address the question?
`context_precision`	Are retrieved chunks actually useful?
`context_recall`	Did retrieval find all needed info?
`answer_correctness`	Does answer match ground truth (when available)?

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
result = evaluate(
    dataset=eval_dataset,
    metrics=[faithfulness, answer_relevancy],
)

These give you a numeric, comparable score per retrieval/generation strategy. Crucial for the eval loop in RAG (see S13).

7. TruLens, DeepEval, Inspect

TruLens — instrumented runtime; collects traces; defines "feedback functions" (RAGAS-ish).
DeepEval — pytest-style assertions for LLM output. assert_relevance(), assert_no_hallucination().
Inspect (UK AISI) — capability + safety eval framework, used for frontier-model red-teaming. Plugin model.
OpenAI Evals — open-source harness, plug-in eval types.
Promptfoo — YAML-driven A/B prompt eval; great for prompt regression.

8. Building your own eval set

Don't skip this. A 100-example, hand-crafted eval set tailored to your product beats any public benchmark.

Process:

Mine real user prompts (anonymise!).
Bucket by intent (10–15 buckets is enough).
For each bucket, write 5–10 gold-standard answers (or rubrics).
Add adversarial examples — your top 20 known failure modes.
Version the set; checksum it; track it like a dataset, not a doc.

Re-run the whole set on every model/prompt change. Track scores in MLflow.

9. Online vs offline eval

Offline — fixed dataset, runs on every release. Fast, repeatable, but stale.
Online — production traffic, label by user reaction (👍/👎, accept-rate, edit-distance to final answer). Slow, biased by current cohort, but real.

Both. Offline catches regressions before ship; online catches reality drift.

10. The eval flywheel

[ Real user fails ] → [ Add to eval set ] → [ Run all candidates ] → [ Pick winner ] → [ Ship ]
        ▲                                                                                   │
        └───────────────────────────────────────────────────────────────────────────────────┘

Every bug becomes a permanent test. After 6 months you have 2000 examples that capture exactly what your product needs. That set is your moat.

11. Cost-aware eval

LLM-as-judge is expensive. Budgeting tips:

Sample, don't run-all-on-everything.
Cache judge results keyed on (prompt, output_a, output_b, judge_model).
Use a cheaper "screening" judge → escalate ambiguous cases to a stronger judge.
Score per-dollar: improvements aren't free; track quality_delta / cost_delta.

12. Reality check

A pragmatic minimum eval stack for a startup:

100 hand-crafted prompts with rubrics (versioned in git).
A pytest job that runs them on every PR using promptfoo or homemade.
Pairwise LLM-judge with order-swap, against GPT-4o.
Production logging of (input, output, model_version, user_feedback).
Weekly review of bottom-decile responses → add to eval set.

This is enough. Buy the platform when you outgrow it, not before.

Link: https://leetcode.com/problems/evaluate-division/
Difficulty: Medium
Why this problem: Build a graph of equations and answer arbitrary queries — exact shape of evaluating one model output against a chain of judge criteria.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Explain why open-ended LLM output makes eval qualitatively harder than classical ML.
List the 4 layers of eval (unit, behavioural, capability, user-outcome).
Design a pairwise LLM-judge prompt with bias mitigations.
Pick the right RAGAS metric (faithfulness vs context_precision vs answer_relevancy) for a given failure.
Sketch the 5-step "eval flywheel" loop.
Solve evaluate-division — graph traversal of weighted edges; mirrors chained eval criteria.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

MLOps — Experiment Tracking, Model Registry, CI/CD for Models

Caching Strategies — CDN, Application Cache, Cache-Aside, Read-Through