ai mlintermediate 12m2026-06-09

Capstone — Building a Production AI Agent End-to-End

Session 48 of the 48-session learning series.

Why this session matters

This is the final session. Everything from the previous 47 — transformers, RAG, evaluation, serving, MLOps, system design, OOP, data engineering — converges in one architecture: a real, production AI agent. The kind you'd build for a customer, defend at an investor meeting, and explain to a regulator. Capstone time.

Agenda

Reference architecture — every layer named, every decision justified
Build order — what you ship in week 1, month 1, quarter 1
Cost, latency, reliability budgets — and the trade matrix
Evaluation, safety, observability — the production hygiene
The 6-month roadmap — what you build next

Pre-read (skim before the session)

All 47 prior session decks. Skim Agenda + Reality-check sections.
Anthropic — Building effective agents
Eugene Yan — Patterns for building LLM-based systems
Lilian Weng — LLM-powered autonomous agents

Deep dive

1. Reference architecture

A production AI agent (customer-support example) has 9 layers:

[ UI / API ]                ← chat widget, voice, embed
       │
[ Gateway ]                 ← auth, rate limit, request log
       │
[ Orchestrator (Agent loop) ] ← plan → tool → reflect → respond
       │
       ├── [ LLM Provider ]      ← Claude / GPT / OS model on vLLM
       │
       ├── [ RAG Layer ]         ← vector + keyword search; rerank
       │
       ├── [ Tools / APIs ]      ← CRM, calendar, knowledge base
       │
       ├── [ Memory ]            ← session + persistent user memory
       │
       ├── [ Safety filters ]    ← input + output moderation
       │
       └── [ Eval / Monitoring ] ← every turn logged + scored

Each box is from one of the 47 prior sessions. The art is gluing them.

2. The build order

Week 1: prototype.

Pick a base model (API).
Hardcode system prompt.
Wire 2–3 simple tools (search, get-record).
Demo to a real user. Get feedback.

Month 1: alpha.

RAG layer over the actual knowledge corpus.
Eval set of 50 hand-crafted prompts.
Logging of every turn.
Basic safety filter (moderation API).
A staging environment with feature-flag rollout.

Quarter 1: production.

Multi-tenant isolation.
Full observability (S46) — latency, cost, quality SLOs.
Auto-eval per release (S25).
Prompt versioning + A/B testing (S42).
Red-team suite running per release (S45).
Incident response process.

Quarter 2+: scale.

Self-hosted inference if cost demands (S30, S34).
Fine-tuned domain model (S33).
Feedback flywheel: thumbs-down → DPO data → next fine-tune.
More tools, more sources, more languages.

3. The agent loop

loop:
    user_msg = get_user_message()
    plan = model.plan(state, user_msg, available_tools)
    if plan == REPLY:
        out = model.generate_reply(state)
        yield out
    elif plan == CALL_TOOL:
        tool_result = call_tool(plan.tool, plan.args)
        state.add(tool_result)
        # back to top
    if turn_count > MAX or budget_exceeded:
        yield "I'm stuck; let me get a human."
        break

Things to add:

Step budget (cost cap).
Hard timeout.
Loop detection.
Confirmation gating on irreversible actions.

4. Tool design

For each tool:

Clear schema (name, params, types, description, examples).
Whitelisted (don't expose eval).
Idempotent where possible.
Rate-limited per user.
Logged in full (request + response).
"Dangerous" tag for human confirmation.

Most agent failures are bad tool definitions, not bad LLMs.

5. Cost model

For a customer-support agent at 1M conversations/month:

avg conversation = 5 turns × 1500 input tokens + 300 output tokens
                 = 7500 in + 1500 out per convo

monthly tokens = 1M × (7500 in + 1500 out) = 7.5B in + 1.5B out

@ $3/1M in + $15/1M out (Claude-class)
  = $22,500/mo + $22,500/mo = $45,000/mo

With prompt caching (50% cache hit): ~$25,000/mo
With self-hosted int4: ~$10,000/mo if you have the engineering bandwidth

Show the math to non-engineering stakeholders early. Cost is everyone's problem.

6. Latency budget

End-to-end target: < 3 s perceived (streaming).

Components:

Retrieval: 100 ms
Reranker: 50 ms
LLM TTFT: 500 ms
LLM generation: streaming, 30 tokens/s
Tool calls (1–2): 300 ms each
Safety filter: 50 ms

Stack carefully. Parallelise where possible (retrieve while planning).

7. Reliability budget

Per dependency:

Model API (99.9% SLA from provider). Have a fallback model.
Vector DB (99.95%). Cache common queries.
CRM (varies). Soft-fail on retrieval errors.
Internal queue (99.99%). Persistent retry.

Composite SLO: take the product of per-dependency SLOs. For 6 dependencies at 99.9%, total = 99.4%. Manage user expectations accordingly.

8. Eval as a discipline

100+ hand-crafted prompts per skill (S25).
LLM-as-judge with bias mitigations.
Per-tool unit tests.
Per-prompt regression test on every change.
A/B test of any user-visible change.
Production: thumbs-up/-down + every-N-th-turn human review.

Build the eval pipeline before the agent's complex. It compounds your iteration speed.

9. Safety in depth

(All from S45):

Input filter for known jailbreaks.
Wrapped delimiters for user + retrieved content.
Tool gating with confirmation on writes.
Output filter for PII + secrets.
Red-team suite weekly.
Incident response process.

10. The data flywheel

Production interactions are gold:

Log every turn with model version, prompt version, retrieved chunks, tool calls, latency, cost, feedback.
Mine thumbs-down for fine-tune data.
Mine thumbs-up for eval positives.
Mine novel queries for RAG corpus gaps.

The flywheel is what makes agents that "get better over time" rather than rotting.

11. The 6-month roadmap

For the typical AI agent product:

M1: alpha (50 internal users).
M2: beta (500 friendly customers).
M3: GA — production SLOs met.
M4: expand tools (more API integrations).
M5: multi-language; multi-tenant.
M6: optimised serving (self-host or distilled).

In parallel, every month:

Eval set grows by 20%.
Red-team set grows by 10%.
Cost-per-conversation declines by 10–20%.
User satisfaction ticks up.

12. The handover document

When you leave this project (job change, promotion, project complete), the next engineer needs:

Architecture diagram (data flow, dependencies, costs).
Runbooks per alert.
Eval suite + how to run it.
Prompt versions + change log.
Tool definitions + risk assessment.
Postmortem archive.
Top-3 known issues.

A good handover = your successor productive in week 1, not month 3.

13. Final reality check

Most AI agents in production right now are:

Built in 2 weeks for the demo.
Stalled at "works for the founder, breaks for users" for 6 months.
Replaced with v2 after the team learns what they wished they'd known.

You're now in a position to write v1 properly. That's the difference this curriculum was meant to make.

14. What's next after 48 sessions

Pick one of these tracks and go deeper. Ship a side project.
Teach what you learned — blog, talk, podcast.
Mentor someone earlier on the path.
Repeat this kind of structured study every year. The field moves; you must too.

Congratulations on finishing the 48. Now do the work.

Reading material

Books:

AI Engineering — Chip Huyen (the canonical book for production LLM systems; the chapter on agents in particular)
Designing Machine Learning Systems — Chip Huyen (the foundation book for ML in production; required prior reading)
Building LLM-powered Applications — Valentina Alto (practitioner walkthrough of RAG, agents, evals)
Site Reliability Engineering + SRE Workbook — Beyer et al. (free Google books; the reliability foundation an agent rests on)
Generative AI with LangChain — Ben Auffarth (the most-used production-pattern book for orchestration frameworks)
All 47 prior session notes — re-read your own MD files; this is your strongest reference.

Papers:

ReAct: Synergizing Reasoning and Acting in Language Models — Yao et al. 2022 — the canonical paper that started production agent design.
Toolformer: Language Models Can Teach Themselves to Use Tools — Schick et al. (Meta) 2023 — the canonical tool-use paper.
Reflexion: Language Agents with Verbal Reinforcement Learning — Shinn et al. 2023 — the canonical self-reflection / iterative-improvement agent.
Voyager: An Open-Ended Embodied Agent with Large Language Models — Wang et al. (NVIDIA) 2023 — the canonical Minecraft agent that pioneered skill libraries + curriculum.
A Survey on Large Language Model based Autonomous Agents — Wang et al. 2023 — the canonical survey of the agent field.
Constitutional AI: Harmlessness from AI Feedback — Bai et al. (Anthropic) 2022 — the safety foundation for an agent that talks to users.

Official docs:

Anthropic — Building effective agents — the canonical 2024 essay from Anthropic; required reading.
OpenAI — Building agents guide — OpenAI's own production-agent guide.
LangChain — Agent documentation — the canonical agent framework (whether or not you use it).
LlamaIndex — Agents — the alternative orchestration framework's agent docs.
Pydantic AI — the modern type-safe agent framework from the Pydantic team.
DSPy — agents — Stanford's prompt-as-code framework with optimizer-tuned agents.
Vercel AI SDK — Agents — the canonical TS/JS agent framework.
Promptfoo + LangSmith — Evals + LangSmith eval docs — the canonical eval-runner pair.

Blog posts:

Eugene Yan — Patterns for LLM-based systems — the canonical production-pattern playbook for everything you'll build.
Hamel Husain — Your AI Product Needs Evals — required reading on tying agent behaviour to a regression suite.
Chip Huyen — Building LLM applications for production — the canonical essay that defined the engineering vocabulary.
Simon Willison — LLM tag — the running practitioner archive; required reading.
Anthropic — Engineering blog — Anthropic's own agent-building posts; required reading.
Latent Space — Agent posts — long-running practitioner podcast + essays on agents in production.

In-depth research material

LangChain — github.com/langchain-ai/langchain — ~90k ★, the most-used Python agent framework.
LlamaIndex — github.com/run-llama/llama_index — ~36k ★, the alternative agent + RAG framework.
Pydantic AI — github.com/pydantic/pydantic-ai — ~4k ★, the modern type-safe agent framework.
Vercel AI SDK — github.com/vercel/ai — ~10k ★, the canonical TS/JS agent framework.
DSPy — github.com/stanfordnlp/dspy — ~17k ★, Stanford's prompt-as-code framework.
Anthropic Cookbook — github.com/anthropics/anthropic-cookbook — production agent recipes from Anthropic.
OpenAI Cookbook — github.com/openai/openai-cookbook — ~58k ★, OpenAI's recipe collection.
Promptfoo — github.com/promptfoo/promptfoo — ~5k ★, the canonical eval-runner.
Langfuse — github.com/langfuse/langfuse — ~6k ★, OSS LLM tracing + eval.
Helicone — github.com/Helicone/helicone — OSS LLM observability.
Auto-GPT — github.com/Significant-Gravitas/AutoGPT — ~170k ★, the famous early agent project; useful as design reference + cautionary tale.
CrewAI — github.com/crewAIInc/crewAI — ~26k ★, multi-agent orchestration framework.
smolagents — github.com/huggingface/smolagents — Hugging Face's minimal-agent framework; clean reference implementation.
Latent Space — Long-Context Agents podcast archive — running practitioner archive.

Videos

Building Production AI Agents — Anthropic (Mike Krieger, Erik Schluntz) — Anthropic engineering · 1 h 02 min — the canonical 2024 talk pairing with the "Building effective agents" essay.
State of GPT — Andrej Karpathy (Microsoft Build 2023) — Andrej Karpathy · 42 min — the canonical end-to-end framing of how the LLM stack works (required pre-capstone refresher).
Your AI Product Needs Evals — Hamel Husain — Hamel Husain · 1 h 14 min — the production talk on eval-first agent development.
Building LLM Applications — Chip Huyen (Stanford CS329S) — Chip Huyen · 1 h 18 min — the lecture version of her canonical essay.
Designing Agents that Actually Work — Eugene Yan + Hamel Husain (Mastering LLMs workshop) — Eugene Yan, Hamel Husain · 1 h 32 min — practitioner workshop on patterns that survive production.

LeetCode — Design In Memory File System

Link: https://leetcode.com/problems/design-in-memory-file-system/
Difficulty: Hard
Why this problem: A capstone problem too — multiple objects, hierarchies, mutations, persistence. Same shape as building a production system: many pieces, one consistent state. (We saw it in S24; come back to it as a final exam.)
Time-box: 45 minutes. This time, don't look at the editorial. You've earned the right to try.

Post-session checklist

By the end of this session — and this 48-session series — you should be able to:

Draw the 9-layer reference architecture of a production AI agent from memory.
Plan a 1-week / 1-month / 1-quarter build order for a new agent product.
Compute the cost of a conversation given token budgets and provider pricing.
Stack timeouts, retries, circuit breakers, bulkheads for each external dependency.
Wire evaluation, safety, observability into the release process.
Run a postmortem, file action items, close the loop.
Pick the right tools/architecture from each prior session, with justification.
Solve design-in-memory-file-system once more — a fitting capstone for the series.

🎓 You've finished the 48 sessions. Now build something.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

Production Error Handling — Retries, Circuit Breakers, Timeouts, Bulkheads

The 48-Session Learning Series — A Planning Guide