ai mladvanced 12m2026-06-26

Day 28 — Putting It Together — A Production AI Agent (Capstone Day)

Final synthesis day. You've covered transformers, RAG, tools, evals, fine-tuning, serving, multimodal. Today you combine them into one complete agent design — a…

This is the synthesis day. Below is a template you can fill in for any real agent build — a checklist of the decisions the previous 27 days armed you to make.

1. Problem framing

Who's the user, what task, what's "good enough"?
Today's baseline (manual, scripted, simple LLM call)?
The 3 KPIs the agent must move (task success, time saved, cost, NPS).

Pick concrete numbers up front. Vague KPIs ("better", "faster") never get measured and never get hit.

2. Architecture in one diagram

3. Model choice

Default to hosted (Claude 3.5 Sonnet, GPT-4o) for capability and reliability.
Smaller / OSS (Llama 3, Qwen 2.5, Gemma 3) for cost, privacy, latency.
Multi-model: cheap router → strong model only when needed.
Fine-tune only after you've shipped, measured, and identified format/behaviour gaps.

4. RAG stack (if applicable)

Chunking strategy (contextual headers).
Embedding model + dimensionality.
Hybrid retrieval + reranker.
Retrieval eval (recall@k, MRR) with a golden set.
Versioning of (chunks, embeddings, prompts).

5. Tools

3-10 well-named, narrowly-scoped tools.
Schemas explicit; descriptions tuned with eval feedback.
Side-effect classes: read-only auto, write requires confirmation.
Idempotency keys on writes.

6. Memory

Short-term: full chat history within budget.
Mid-term: summarised episodes.
Long-term: vector store, retrieved on relevance.

7. Guardrails

Max steps cap.
Cost budget per session.
Output schema validation.
PII redaction in logs.
Prompt injection defence (separate system/user roles, content filters, allow-list of tool args from retrieved content).

8. Eval harness

Golden set (50-200, stratified by slice and difficulty).
LLM-as-judge with calibrated rubric.
Task-success metrics (binary or graded).
Trajectory metrics (steps, tool errors).
Cost / latency p50/p95.
CI gate: any slice drop > X% blocks merge.

🚀 In Practice

Trade-offs, exercises, what to ship today.

9. Observability

Per-turn structured trace (turn id, tokens in/out, tool calls, latency, cost).
Cohort dashboards (model, prompt version, user segment).
Sampling: 100% errors, 10% successes, 100% high-cost.
Privacy: hash user PII before logging.

10. Rollout

Internal dogfood (week 1).
Closed beta with feedback button (week 2-3).
Shadow mode for live traffic compared to baseline.
5% canary → monitor metrics 3-7 days.
Gradual rollout with revert button.

11. Cost model

Estimate per session: avg tokens × $/token × turns + tool/API costs. Set budget alerts at 70% / 90% of monthly forecast.

12. Risks register

Risk	Likelihood	Impact	Mitigation
Hallucination on facts	Med	High	RAG + faithfulness eval + threshold abstain
Prompt injection	Low	High	Tool arg validation + content filters + role separation
Cost runaway	Med	Med	Budget alerts + per-session cap
Model deprecation	Med	Low	Abstract client, regression suite
Latency spikes	Med	Med	Streaming + circuit breakers + fallback model

You shipped 28 days. You can now answer at staff-engineer depth across DE / ML / LLM / OOP / SYS. The capstone deliverable — a real design doc, reviewed — is what proves it. Print it, share it with one senior engineer, iterate.

Onwards.

Key points

Resources

Practice Problem: Word Ladder (Hard)

← previous

Day 27 — API Design — REST, GraphQL, gRPC; Versioning, Pagination, Errors