Day 28 — Putting It Together — A Production AI Agent (Capstone Day)
Final synthesis day. You've covered transformers, RAG, tools, evals, fine-tuning, serving, multimodal. Today you combine them into one complete agent design — a…
This is the synthesis day. Below is a template you can fill in for any real agent build — a checklist of the decisions the previous 27 days armed you to make.
🧠 Concept
Why it matters & the mental model.
1. Problem framing
- Who's the user, what task, what's "good enough"?
- Today's baseline (manual, scripted, simple LLM call)?
- The 3 KPIs the agent must move (task success, time saved, cost, NPS).
Pick concrete numbers up front. Vague KPIs ("better", "faster") never get measured and never get hit.
2. Architecture in one diagram
3. Model choice
- Default to hosted (Claude 3.5 Sonnet, GPT-4o) for capability and reliability.
- Smaller / OSS (Llama 3, Qwen 2.5, Gemma 3) for cost, privacy, latency.
- Multi-model: cheap router → strong model only when needed.
- Fine-tune only after you've shipped, measured, and identified format/behaviour gaps.
4. RAG stack (if applicable)
- Chunking strategy (contextual headers).
- Embedding model + dimensionality.
- Hybrid retrieval + reranker.
- Retrieval eval (recall@k, MRR) with a golden set.
- Versioning of (chunks, embeddings, prompts).
🛠 Deep Dive
Internals, code, architecture.
5. Tools
- 3-10 well-named, narrowly-scoped tools.
- Schemas explicit; descriptions tuned with eval feedback.
- Side-effect classes: read-only auto, write requires confirmation.
- Idempotency keys on writes.
6. Memory
- Short-term: full chat history within budget.
- Mid-term: summarised episodes.
- Long-term: vector store, retrieved on relevance.
7. Guardrails
- Max steps cap.
- Cost budget per session.
- Output schema validation.
- PII redaction in logs.
- Prompt injection defence (separate system/user roles, content filters, allow-list of tool args from retrieved content).
8. Eval harness
- Golden set (50-200, stratified by slice and difficulty).
- LLM-as-judge with calibrated rubric.
- Task-success metrics (binary or graded).
- Trajectory metrics (steps, tool errors).
- Cost / latency p50/p95.
- CI gate: any slice drop > X% blocks merge.
🚀 In Practice
Trade-offs, exercises, what to ship today.
9. Observability
- Per-turn structured trace (turn id, tokens in/out, tool calls, latency, cost).
- Cohort dashboards (model, prompt version, user segment).
- Sampling: 100% errors, 10% successes, 100% high-cost.
- Privacy: hash user PII before logging.
10. Rollout
- Internal dogfood (week 1).
- Closed beta with feedback button (week 2-3).
- Shadow mode for live traffic compared to baseline.
- 5% canary → monitor metrics 3-7 days.
- Gradual rollout with revert button.
11. Cost model
Estimate per session: avg tokens × $/token × turns + tool/API costs. Set budget alerts at 70% / 90% of monthly forecast.
12. Risks register
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Hallucination on facts | Med | High | RAG + faithfulness eval + threshold abstain |
| Prompt injection | Low | High | Tool arg validation + content filters + role separation |
| Cost runaway | Med | Med | Budget alerts + per-session cap |
| Model deprecation | Med | Low | Abstract client, regression suite |
| Latency spikes | Med | Med | Streaming + circuit breakers + fallback model |
13. Final reflection
You shipped 28 days. You can now answer at staff-engineer depth across DE / ML / LLM / OOP / SYS. The capstone deliverable — a real design doc, reviewed — is what proves it. Print it, share it with one senior engineer, iterate.
Onwards.
Resources
- 🎥 Anthropic — Building Effective Agents (re-watch with system lens)
- 📖 Hamel Husain — Your AI product needs evals
- 📖 Eugene Yan — Patterns for Building LLM-based Systems
- 📖 Chip Huyen — Building LLM applications for production
Practice Problem: Word Ladder (Hard)