Capstone — Building a Production AI Agent End-to-End
Session 48 of the 48-session learning series.
Why this session matters
This is the final session. Everything from the previous 47 — transformers, RAG, evaluation, serving, MLOps, system design, OOP, data engineering — converges in one architecture: a real, production AI agent. The kind you'd build for a customer, defend at an investor meeting, and explain to a regulator. Capstone time.
Agenda
- Reference architecture — every layer named, every decision justified
- Build order — what you ship in week 1, month 1, quarter 1
- Cost, latency, reliability budgets — and the trade matrix
- Evaluation, safety, observability — the production hygiene
- The 6-month roadmap — what you build next
Pre-read (skim before the session)
- All 47 prior session decks. Skim Agenda + Reality-check sections.
- Anthropic — Building effective agents
- Eugene Yan — Patterns for building LLM-based systems
- Lilian Weng — LLM-powered autonomous agents
Deep dive
1. Reference architecture
A production AI agent (customer-support example) has 9 layers:
[ UI / API ] ← chat widget, voice, embed
│
[ Gateway ] ← auth, rate limit, request log
│
[ Orchestrator (Agent loop) ] ← plan → tool → reflect → respond
│
├── [ LLM Provider ] ← Claude / GPT / OS model on vLLM
│
├── [ RAG Layer ] ← vector + keyword search; rerank
│
├── [ Tools / APIs ] ← CRM, calendar, knowledge base
│
├── [ Memory ] ← session + persistent user memory
│
├── [ Safety filters ] ← input + output moderation
│
└── [ Eval / Monitoring ] ← every turn logged + scored
Each box is from one of the 47 prior sessions. The art is gluing them.
2. The build order
Week 1: prototype.
- Pick a base model (API).
- Hardcode system prompt.
- Wire 2–3 simple tools (search, get-record).
- Demo to a real user. Get feedback.
Month 1: alpha.
- RAG layer over the actual knowledge corpus.
- Eval set of 50 hand-crafted prompts.
- Logging of every turn.
- Basic safety filter (moderation API).
- A staging environment with feature-flag rollout.
Quarter 1: production.
- Multi-tenant isolation.
- Full observability (S46) — latency, cost, quality SLOs.
- Auto-eval per release (S25).
- Prompt versioning + A/B testing (S42).
- Red-team suite running per release (S45).
- Incident response process.
Quarter 2+: scale.
- Self-hosted inference if cost demands (S30, S34).
- Fine-tuned domain model (S33).
- Feedback flywheel: thumbs-down → DPO data → next fine-tune.
- More tools, more sources, more languages.
3. The agent loop
loop:
user_msg = get_user_message()
plan = model.plan(state, user_msg, available_tools)
if plan == REPLY:
out = model.generate_reply(state)
yield out
elif plan == CALL_TOOL:
tool_result = call_tool(plan.tool, plan.args)
state.add(tool_result)
# back to top
if turn_count > MAX or budget_exceeded:
yield "I'm stuck; let me get a human."
break
Things to add:
- Step budget (cost cap).
- Hard timeout.
- Loop detection.
- Confirmation gating on irreversible actions.
4. Tool design
For each tool:
- Clear schema (name, params, types, description, examples).
- Whitelisted (don't expose
eval). - Idempotent where possible.
- Rate-limited per user.
- Logged in full (request + response).
- "Dangerous" tag for human confirmation.
Most agent failures are bad tool definitions, not bad LLMs.
5. Cost model
For a customer-support agent at 1M conversations/month:
avg conversation = 5 turns × 1500 input tokens + 300 output tokens
= 7500 in + 1500 out per convo
monthly tokens = 1M × (7500 in + 1500 out) = 7.5B in + 1.5B out
@ $3/1M in + $15/1M out (Claude-class)
= $22,500/mo + $22,500/mo = $45,000/mo
With prompt caching (50% cache hit): ~$25,000/mo
With self-hosted int4: ~$10,000/mo if you have the engineering bandwidth
Show the math to non-engineering stakeholders early. Cost is everyone's problem.
6. Latency budget
End-to-end target: < 3 s perceived (streaming).
Components:
- Retrieval: 100 ms
- Reranker: 50 ms
- LLM TTFT: 500 ms
- LLM generation: streaming, 30 tokens/s
- Tool calls (1–2): 300 ms each
- Safety filter: 50 ms
Stack carefully. Parallelise where possible (retrieve while planning).
7. Reliability budget
Per dependency:
- Model API (99.9% SLA from provider). Have a fallback model.
- Vector DB (99.95%). Cache common queries.
- CRM (varies). Soft-fail on retrieval errors.
- Internal queue (99.99%). Persistent retry.
Composite SLO: take the product of per-dependency SLOs. For 6 dependencies at 99.9%, total = 99.4%. Manage user expectations accordingly.
8. Eval as a discipline
- 100+ hand-crafted prompts per skill (S25).
- LLM-as-judge with bias mitigations.
- Per-tool unit tests.
- Per-prompt regression test on every change.
- A/B test of any user-visible change.
- Production: thumbs-up/-down + every-N-th-turn human review.
Build the eval pipeline before the agent's complex. It compounds your iteration speed.
9. Safety in depth
(All from S45):
- Input filter for known jailbreaks.
- Wrapped delimiters for user + retrieved content.
- Tool gating with confirmation on writes.
- Output filter for PII + secrets.
- Red-team suite weekly.
- Incident response process.
10. The data flywheel
Production interactions are gold:
- Log every turn with model version, prompt version, retrieved chunks, tool calls, latency, cost, feedback.
- Mine thumbs-down for fine-tune data.
- Mine thumbs-up for eval positives.
- Mine novel queries for RAG corpus gaps.
The flywheel is what makes agents that "get better over time" rather than rotting.
11. The 6-month roadmap
For the typical AI agent product:
- M1: alpha (50 internal users).
- M2: beta (500 friendly customers).
- M3: GA — production SLOs met.
- M4: expand tools (more API integrations).
- M5: multi-language; multi-tenant.
- M6: optimised serving (self-host or distilled).
In parallel, every month:
- Eval set grows by 20%.
- Red-team set grows by 10%.
- Cost-per-conversation declines by 10–20%.
- User satisfaction ticks up.
12. The handover document
When you leave this project (job change, promotion, project complete), the next engineer needs:
- Architecture diagram (data flow, dependencies, costs).
- Runbooks per alert.
- Eval suite + how to run it.
- Prompt versions + change log.
- Tool definitions + risk assessment.
- Postmortem archive.
- Top-3 known issues.
A good handover = your successor productive in week 1, not month 3.
13. Final reality check
Most AI agents in production right now are:
- Built in 2 weeks for the demo.
- Stalled at "works for the founder, breaks for users" for 6 months.
- Replaced with v2 after the team learns what they wished they'd known.
You're now in a position to write v1 properly. That's the difference this curriculum was meant to make.
14. What's next after 48 sessions
- Pick one of these tracks and go deeper. Ship a side project.
- Teach what you learned — blog, talk, podcast.
- Mentor someone earlier on the path.
- Repeat this kind of structured study every year. The field moves; you must too.
Congratulations on finishing the 48. Now do the work.
Reading material
Books:
- AI Engineering — Chip Huyen (the canonical book for production LLM systems; the chapter on agents in particular)
- Designing Machine Learning Systems — Chip Huyen (the foundation book for ML in production; required prior reading)
- Building LLM-powered Applications — Valentina Alto (practitioner walkthrough of RAG, agents, evals)
- Site Reliability Engineering + SRE Workbook — Beyer et al. (free Google books; the reliability foundation an agent rests on)
- Generative AI with LangChain — Ben Auffarth (the most-used production-pattern book for orchestration frameworks)
- All 47 prior session notes — re-read your own MD files; this is your strongest reference.
Papers:
- ReAct: Synergizing Reasoning and Acting in Language Models — Yao et al. 2022 — the canonical paper that started production agent design.
- Toolformer: Language Models Can Teach Themselves to Use Tools — Schick et al. (Meta) 2023 — the canonical tool-use paper.
- Reflexion: Language Agents with Verbal Reinforcement Learning — Shinn et al. 2023 — the canonical self-reflection / iterative-improvement agent.
- Voyager: An Open-Ended Embodied Agent with Large Language Models — Wang et al. (NVIDIA) 2023 — the canonical Minecraft agent that pioneered skill libraries + curriculum.
- A Survey on Large Language Model based Autonomous Agents — Wang et al. 2023 — the canonical survey of the agent field.
- Constitutional AI: Harmlessness from AI Feedback — Bai et al. (Anthropic) 2022 — the safety foundation for an agent that talks to users.
Official docs:
- Anthropic — Building effective agents — the canonical 2024 essay from Anthropic; required reading.
- OpenAI — Building agents guide — OpenAI's own production-agent guide.
- LangChain — Agent documentation — the canonical agent framework (whether or not you use it).
- LlamaIndex — Agents — the alternative orchestration framework's agent docs.
- Pydantic AI — the modern type-safe agent framework from the Pydantic team.
- DSPy — agents — Stanford's prompt-as-code framework with optimizer-tuned agents.
- Vercel AI SDK — Agents — the canonical TS/JS agent framework.
- Promptfoo + LangSmith — Evals + LangSmith eval docs — the canonical eval-runner pair.
Blog posts:
- Eugene Yan — Patterns for LLM-based systems — the canonical production-pattern playbook for everything you'll build.
- Hamel Husain — Your AI Product Needs Evals — required reading on tying agent behaviour to a regression suite.
- Chip Huyen — Building LLM applications for production — the canonical essay that defined the engineering vocabulary.
- Simon Willison — LLM tag — the running practitioner archive; required reading.
- Anthropic — Engineering blog — Anthropic's own agent-building posts; required reading.
- Latent Space — Agent posts — long-running practitioner podcast + essays on agents in production.
In-depth research material
- LangChain — github.com/langchain-ai/langchain — ~90k ★, the most-used Python agent framework.
- LlamaIndex — github.com/run-llama/llama_index — ~36k ★, the alternative agent + RAG framework.
- Pydantic AI — github.com/pydantic/pydantic-ai — ~4k ★, the modern type-safe agent framework.
- Vercel AI SDK — github.com/vercel/ai — ~10k ★, the canonical TS/JS agent framework.
- DSPy — github.com/stanfordnlp/dspy — ~17k ★, Stanford's prompt-as-code framework.
- Anthropic Cookbook — github.com/anthropics/anthropic-cookbook — production agent recipes from Anthropic.
- OpenAI Cookbook — github.com/openai/openai-cookbook — ~58k ★, OpenAI's recipe collection.
- Promptfoo — github.com/promptfoo/promptfoo — ~5k ★, the canonical eval-runner.
- Langfuse — github.com/langfuse/langfuse — ~6k ★, OSS LLM tracing + eval.
- Helicone — github.com/Helicone/helicone — OSS LLM observability.
- Auto-GPT — github.com/Significant-Gravitas/AutoGPT — ~170k ★, the famous early agent project; useful as design reference + cautionary tale.
- CrewAI — github.com/crewAIInc/crewAI — ~26k ★, multi-agent orchestration framework.
- smolagents — github.com/huggingface/smolagents — Hugging Face's minimal-agent framework; clean reference implementation.
- Latent Space — Long-Context Agents podcast archive — running practitioner archive.
Videos
- Building Production AI Agents — Anthropic (Mike Krieger, Erik Schluntz) — Anthropic engineering · 1 h 02 min — the canonical 2024 talk pairing with the "Building effective agents" essay.
- State of GPT — Andrej Karpathy (Microsoft Build 2023) — Andrej Karpathy · 42 min — the canonical end-to-end framing of how the LLM stack works (required pre-capstone refresher).
- Your AI Product Needs Evals — Hamel Husain — Hamel Husain · 1 h 14 min — the production talk on eval-first agent development.
- Building LLM Applications — Chip Huyen (Stanford CS329S) — Chip Huyen · 1 h 18 min — the lecture version of her canonical essay.
- Designing Agents that Actually Work — Eugene Yan + Hamel Husain (Mastering LLMs workshop) — Eugene Yan, Hamel Husain · 1 h 32 min — practitioner workshop on patterns that survive production.
LeetCode — Design In Memory File System
- Link: https://leetcode.com/problems/design-in-memory-file-system/
- Difficulty: Hard
- Why this problem: A capstone problem too — multiple objects, hierarchies, mutations, persistence. Same shape as building a production system: many pieces, one consistent state. (We saw it in S24; come back to it as a final exam.)
- Time-box: 45 minutes. This time, don't look at the editorial. You've earned the right to try.
Post-session checklist
By the end of this session — and this 48-session series — you should be able to:
- Draw the 9-layer reference architecture of a production AI agent from memory.
- Plan a 1-week / 1-month / 1-quarter build order for a new agent product.
- Compute the cost of a conversation given token budgets and provider pricing.
- Stack timeouts, retries, circuit breakers, bulkheads for each external dependency.
- Wire evaluation, safety, observability into the release process.
- Run a postmortem, file action items, close the loop.
- Pick the right tools/architecture from each prior session, with justification.
- Solve
design-in-memory-file-systemonce more — a fitting capstone for the series.
🎓 You've finished the 48 sessions. Now build something.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.