LLM Evaluation — Benchmarks, LLM-as-Judge, RAGAS, Inspect
Session 25 of the 48-session learning series.
Date: Sun, 2026-06-28 · Time: 14:30–16:30 IST · Track: 🧠 LLMs & Agents (LLM) · Parent 28-day topic: Day 16 · Est. read: 2 h
Why this session matters
This is Session 25 of 48 in the LLM track. Evaluation is the unsexy half of LLM work — and the half that decides whether your shiny demo survives contact with real users. You can't ship what you can't measure.
Agenda
- Why evaluating LLMs is genuinely harder than classical ML
- Static benchmarks — MMLU, HumanEval, GSM8K — and their limits
- LLM-as-judge — pairwise, score-based, the bias gotchas
- RAGAS, TruLens — RAG-specific eval frameworks
- Inspect, OpenAI Evals — building your own eval harness
Pre-read (skim before the session)
- Anthropic — Challenges in evaluating AI systems
- RAGAS docs — Metrics
- Judging LLM-as-a-Judge (Zheng et al., 2023)
- Inspect AI — UK AISI eval framework
Deep dive
1. Why LLM eval is hard
Classical ML: one label, one prediction, clean metric (accuracy, AUC). LLMs:
- Open-ended output — no single right answer. "Summarise this" has 1000 acceptable answers.
- Multi-dimensional quality — factuality, fluency, safety, tone, helpfulness; can trade off.
- Reference-free settings — you often don't have a ground truth ("write a marketing email").
- Long-tail failures — the model is fine 99% of the time and catastrophically wrong on edge cases.
- Contamination — test set leaked into pre-training corpus → benchmark inflated.
You can't compute "accuracy" against a list of strings and call it a day.
2. The 4 layers of eval
[ Unit-level ] → does the prompt produce well-formed JSON?
[ Behavioural ] → does it refuse the jailbreak? answer the maths question?
[ Capability ] → MMLU, HumanEval, MATH on held-out data
[ User-outcome ] → did the user accept the suggestion? task success?
The bottom two cost the most and matter the most. Most teams over-invest in capability benchmarks and under-invest in user-outcome metrics.
3. Static benchmarks — useful, but...
| Benchmark | Tests | Gotcha |
|---|---|---|
| MMLU | 57-subject multiple choice | Contamination; rote knowledge |
| HumanEval | Python function from docstring | Tiny (164 tasks); plateaued |
| MATH | Competition maths | Reasoning + arithmetic mix |
| GSM8K | Grade-school word problems | Largely solved; check GSM-Symbolic |
| MT-Bench | Multi-turn chat, LLM-judged | Judge bias; small |
| HELM | Broad suite | Heavy, dated; good audit trail |
| BBH | "Hard" sub-tasks of BIG-Bench | Mixed quality |
| ARC-AGI | Visual puzzles | The reasoning bar; expensive to run |
Rule: use benchmarks to exclude models, not to pick them. If MMLU is < 60% your candidate, drop it. Above some threshold, benchmarks stop correlating with what you actually need.
4. LLM-as-Judge — the workhorse
Use a strong model (GPT-4-class) to evaluate outputs of another model. Three flavours:
- Pairwise — show judge two outputs A and B, ask which is better. Most reliable.
- Scalar — rate output 1–10 on dimension X. Easy but noisy.
- Rubric-based — multi-criteria scoring against a written rubric. Best for production.
Pairwise example:
You are a judge. Compare two assistant responses to the same prompt.
Decide which is better on: helpfulness, accuracy, conciseness.
Output JSON: {"winner": "A" | "B" | "tie", "reason": "..."}
Prompt: {user_prompt}
Response A: {output_a}
Response B: {output_b}
5. Judge biases (this will burn you)
- Position bias — first option wins more often. Mitigation: swap order, average.
- Length bias — longer answers preferred even when worse. Mitigation: length-controlled rubric.
- Self-preference — model judges its own family higher. Mitigation: use a different model family as judge.
- Verbosity bias — flowery > terse. Mitigation: explicit rubric criterion.
- Authority bias — confident wrong > tentative correct. Mitigation: factuality sub-score.
Validate the judge against ~200 human-labelled pairs before you trust it for thousands of evals.
6. RAGAS — RAG-specific metrics
For retrieval-augmented systems:
| Metric | What it measures |
|---|---|
faithfulness | Are all claims in answer supported by context? |
answer_relevancy | Does the answer actually address the question? |
context_precision | Are retrieved chunks actually useful? |
context_recall | Did retrieval find all needed info? |
answer_correctness | Does answer match ground truth (when available)? |
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
result = evaluate(
dataset=eval_dataset,
metrics=[faithfulness, answer_relevancy],
)
These give you a numeric, comparable score per retrieval/generation strategy. Crucial for the eval loop in RAG (see S13).
7. TruLens, DeepEval, Inspect
- TruLens — instrumented runtime; collects traces; defines "feedback functions" (RAGAS-ish).
- DeepEval — pytest-style assertions for LLM output.
assert_relevance(),assert_no_hallucination(). - Inspect (UK AISI) — capability + safety eval framework, used for frontier-model red-teaming. Plugin model.
- OpenAI Evals — open-source harness, plug-in eval types.
- Promptfoo — YAML-driven A/B prompt eval; great for prompt regression.
8. Building your own eval set
Don't skip this. A 100-example, hand-crafted eval set tailored to your product beats any public benchmark.
Process:
- Mine real user prompts (anonymise!).
- Bucket by intent (10–15 buckets is enough).
- For each bucket, write 5–10 gold-standard answers (or rubrics).
- Add adversarial examples — your top 20 known failure modes.
- Version the set; checksum it; track it like a dataset, not a doc.
Re-run the whole set on every model/prompt change. Track scores in MLflow.
9. Online vs offline eval
- Offline — fixed dataset, runs on every release. Fast, repeatable, but stale.
- Online — production traffic, label by user reaction (👍/👎, accept-rate, edit-distance to final answer). Slow, biased by current cohort, but real.
Both. Offline catches regressions before ship; online catches reality drift.
10. The eval flywheel
[ Real user fails ] → [ Add to eval set ] → [ Run all candidates ] → [ Pick winner ] → [ Ship ]
▲ │
└───────────────────────────────────────────────────────────────────────────────────┘
Every bug becomes a permanent test. After 6 months you have 2000 examples that capture exactly what your product needs. That set is your moat.
11. Cost-aware eval
LLM-as-judge is expensive. Budgeting tips:
- Sample, don't run-all-on-everything.
- Cache judge results keyed on
(prompt, output_a, output_b, judge_model). - Use a cheaper "screening" judge → escalate ambiguous cases to a stronger judge.
- Score per-dollar: improvements aren't free; track
quality_delta / cost_delta.
12. Reality check
A pragmatic minimum eval stack for a startup:
- 100 hand-crafted prompts with rubrics (versioned in git).
- A pytest job that runs them on every PR using
promptfooor homemade. - Pairwise LLM-judge with order-swap, against GPT-4o.
- Production logging of (input, output, model_version, user_feedback).
- Weekly review of bottom-decile responses → add to eval set.
This is enough. Buy the platform when you outgrow it, not before.
Reading material
- Judging LLM-as-a-Judge (Zheng et al., 2023)
- Anthropic — Challenges in evaluating AI systems
- RAGAS — Metrics docs
- Eugene Yan — Eval-driven development
In-depth research material
Video reference
▶︎ LLM Evaluation Explained (Hamel Husain)
Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.
LeetCode — Evaluate Division
- Link: https://leetcode.com/problems/evaluate-division/
- Difficulty: Medium
- Why this problem: Build a graph of equations and answer arbitrary queries — exact shape of evaluating one model output against a chain of judge criteria.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Explain why open-ended LLM output makes eval qualitatively harder than classical ML.
- List the 4 layers of eval (unit, behavioural, capability, user-outcome).
- Design a pairwise LLM-judge prompt with bias mitigations.
- Pick the right RAGAS metric (faithfulness vs context_precision vs answer_relevancy) for a given failure.
- Sketch the 5-step "eval flywheel" loop.
- Solve
evaluate-division— graph traversal of weighted edges; mirrors chained eval criteria.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.