Search Tech Journey

Find topics, journeys and posts

back to blog
ai mladvanced 12m2026-06-26

Day 28 — Putting It Together — A Production AI Agent (Capstone Day)

Final synthesis day. You've covered transformers, RAG, tools, evals, fine-tuning, serving, multimodal. Today you combine them into one complete agent design — a…

This is the synthesis day. Below is a template you can fill in for any real agent build — a checklist of the decisions the previous 27 days armed you to make.

🧠 Concept

Why it matters & the mental model.

1. Problem framing

  • Who's the user, what task, what's "good enough"?
  • Today's baseline (manual, scripted, simple LLM call)?
  • The 3 KPIs the agent must move (task success, time saved, cost, NPS).

Pick concrete numbers up front. Vague KPIs ("better", "faster") never get measured and never get hit.

2. Architecture in one diagram

3. Model choice

  • Default to hosted (Claude 3.5 Sonnet, GPT-4o) for capability and reliability.
  • Smaller / OSS (Llama 3, Qwen 2.5, Gemma 3) for cost, privacy, latency.
  • Multi-model: cheap router → strong model only when needed.
  • Fine-tune only after you've shipped, measured, and identified format/behaviour gaps.

4. RAG stack (if applicable)

  • Chunking strategy (contextual headers).
  • Embedding model + dimensionality.
  • Hybrid retrieval + reranker.
  • Retrieval eval (recall@k, MRR) with a golden set.
  • Versioning of (chunks, embeddings, prompts).

🛠 Deep Dive

Internals, code, architecture.

5. Tools

  • 3-10 well-named, narrowly-scoped tools.
  • Schemas explicit; descriptions tuned with eval feedback.
  • Side-effect classes: read-only auto, write requires confirmation.
  • Idempotency keys on writes.

6. Memory

  • Short-term: full chat history within budget.
  • Mid-term: summarised episodes.
  • Long-term: vector store, retrieved on relevance.

7. Guardrails

  • Max steps cap.
  • Cost budget per session.
  • Output schema validation.
  • PII redaction in logs.
  • Prompt injection defence (separate system/user roles, content filters, allow-list of tool args from retrieved content).

8. Eval harness

  • Golden set (50-200, stratified by slice and difficulty).
  • LLM-as-judge with calibrated rubric.
  • Task-success metrics (binary or graded).
  • Trajectory metrics (steps, tool errors).
  • Cost / latency p50/p95.
  • CI gate: any slice drop > X% blocks merge.

🚀 In Practice

Trade-offs, exercises, what to ship today.

9. Observability

  • Per-turn structured trace (turn id, tokens in/out, tool calls, latency, cost).
  • Cohort dashboards (model, prompt version, user segment).
  • Sampling: 100% errors, 10% successes, 100% high-cost.
  • Privacy: hash user PII before logging.

10. Rollout

  1. Internal dogfood (week 1).
  2. Closed beta with feedback button (week 2-3).
  3. Shadow mode for live traffic compared to baseline.
  4. 5% canary → monitor metrics 3-7 days.
  5. Gradual rollout with revert button.

11. Cost model

Estimate per session: avg tokens × $/token × turns + tool/API costs. Set budget alerts at 70% / 90% of monthly forecast.

12. Risks register

RiskLikelihoodImpactMitigation
Hallucination on factsMedHighRAG + faithfulness eval + threshold abstain
Prompt injectionLowHighTool arg validation + content filters + role separation
Cost runawayMedMedBudget alerts + per-session cap
Model deprecationMedLowAbstract client, regression suite
Latency spikesMedMedStreaming + circuit breakers + fallback model

13. Final reflection

You shipped 28 days. You can now answer at staff-engineer depth across DE / ML / LLM / OOP / SYS. The capstone deliverable — a real design doc, reviewed — is what proves it. Print it, share it with one senior engineer, iterate.

Onwards.

Key points

    Resources

    Practice Problem: Word Ladder (Hard)