Search Tech Journey

Find topics, journeys and posts

back to blog
ai mlintermediate 12m2026-06-09

Capstone — Building a Production AI Agent End-to-End

Session 48 of the 48-session learning series.

Why this session matters

This is the final session. Everything from the previous 47 — transformers, RAG, evaluation, serving, MLOps, system design, OOP, data engineering — converges in one architecture: a real, production AI agent. The kind you'd build for a customer, defend at an investor meeting, and explain to a regulator. Capstone time.

Agenda

  • Reference architecture — every layer named, every decision justified
  • Build order — what you ship in week 1, month 1, quarter 1
  • Cost, latency, reliability budgets — and the trade matrix
  • Evaluation, safety, observability — the production hygiene
  • The 6-month roadmap — what you build next

Pre-read (skim before the session)

Deep dive

1. Reference architecture

A production AI agent (customer-support example) has 9 layers:

[ UI / API ]                ← chat widget, voice, embed
       │
[ Gateway ]                 ← auth, rate limit, request log
       │
[ Orchestrator (Agent loop) ] ← plan → tool → reflect → respond
       │
       ├── [ LLM Provider ]      ← Claude / GPT / OS model on vLLM
       │
       ├── [ RAG Layer ]         ← vector + keyword search; rerank
       │
       ├── [ Tools / APIs ]      ← CRM, calendar, knowledge base
       │
       ├── [ Memory ]            ← session + persistent user memory
       │
       ├── [ Safety filters ]    ← input + output moderation
       │
       └── [ Eval / Monitoring ] ← every turn logged + scored

Each box is from one of the 47 prior sessions. The art is gluing them.

2. The build order

Week 1: prototype.

  • Pick a base model (API).
  • Hardcode system prompt.
  • Wire 2–3 simple tools (search, get-record).
  • Demo to a real user. Get feedback.

Month 1: alpha.

  • RAG layer over the actual knowledge corpus.
  • Eval set of 50 hand-crafted prompts.
  • Logging of every turn.
  • Basic safety filter (moderation API).
  • A staging environment with feature-flag rollout.

Quarter 1: production.

  • Multi-tenant isolation.
  • Full observability (S46) — latency, cost, quality SLOs.
  • Auto-eval per release (S25).
  • Prompt versioning + A/B testing (S42).
  • Red-team suite running per release (S45).
  • Incident response process.

Quarter 2+: scale.

  • Self-hosted inference if cost demands (S30, S34).
  • Fine-tuned domain model (S33).
  • Feedback flywheel: thumbs-down → DPO data → next fine-tune.
  • More tools, more sources, more languages.

3. The agent loop

loop:
    user_msg = get_user_message()
    plan = model.plan(state, user_msg, available_tools)
    if plan == REPLY:
        out = model.generate_reply(state)
        yield out
    elif plan == CALL_TOOL:
        tool_result = call_tool(plan.tool, plan.args)
        state.add(tool_result)
        # back to top
    if turn_count > MAX or budget_exceeded:
        yield "I'm stuck; let me get a human."
        break

Things to add:

  • Step budget (cost cap).
  • Hard timeout.
  • Loop detection.
  • Confirmation gating on irreversible actions.

4. Tool design

For each tool:

  • Clear schema (name, params, types, description, examples).
  • Whitelisted (don't expose eval).
  • Idempotent where possible.
  • Rate-limited per user.
  • Logged in full (request + response).
  • "Dangerous" tag for human confirmation.

Most agent failures are bad tool definitions, not bad LLMs.

5. Cost model

For a customer-support agent at 1M conversations/month:

avg conversation = 5 turns × 1500 input tokens + 300 output tokens
                 = 7500 in + 1500 out per convo

monthly tokens = 1M × (7500 in + 1500 out) = 7.5B in + 1.5B out

@ $3/1M in + $15/1M out (Claude-class)
  = $22,500/mo + $22,500/mo = $45,000/mo

With prompt caching (50% cache hit): ~$25,000/mo
With self-hosted int4: ~$10,000/mo if you have the engineering bandwidth

Show the math to non-engineering stakeholders early. Cost is everyone's problem.

6. Latency budget

End-to-end target: < 3 s perceived (streaming).

Components:

  • Retrieval: 100 ms
  • Reranker: 50 ms
  • LLM TTFT: 500 ms
  • LLM generation: streaming, 30 tokens/s
  • Tool calls (1–2): 300 ms each
  • Safety filter: 50 ms

Stack carefully. Parallelise where possible (retrieve while planning).

7. Reliability budget

Per dependency:

  • Model API (99.9% SLA from provider). Have a fallback model.
  • Vector DB (99.95%). Cache common queries.
  • CRM (varies). Soft-fail on retrieval errors.
  • Internal queue (99.99%). Persistent retry.

Composite SLO: take the product of per-dependency SLOs. For 6 dependencies at 99.9%, total = 99.4%. Manage user expectations accordingly.

8. Eval as a discipline

  • 100+ hand-crafted prompts per skill (S25).
  • LLM-as-judge with bias mitigations.
  • Per-tool unit tests.
  • Per-prompt regression test on every change.
  • A/B test of any user-visible change.
  • Production: thumbs-up/-down + every-N-th-turn human review.

Build the eval pipeline before the agent's complex. It compounds your iteration speed.

9. Safety in depth

(All from S45):

  • Input filter for known jailbreaks.
  • Wrapped delimiters for user + retrieved content.
  • Tool gating with confirmation on writes.
  • Output filter for PII + secrets.
  • Red-team suite weekly.
  • Incident response process.

10. The data flywheel

Production interactions are gold:

  • Log every turn with model version, prompt version, retrieved chunks, tool calls, latency, cost, feedback.
  • Mine thumbs-down for fine-tune data.
  • Mine thumbs-up for eval positives.
  • Mine novel queries for RAG corpus gaps.

The flywheel is what makes agents that "get better over time" rather than rotting.

11. The 6-month roadmap

For the typical AI agent product:

  • M1: alpha (50 internal users).
  • M2: beta (500 friendly customers).
  • M3: GA — production SLOs met.
  • M4: expand tools (more API integrations).
  • M5: multi-language; multi-tenant.
  • M6: optimised serving (self-host or distilled).

In parallel, every month:

  • Eval set grows by 20%.
  • Red-team set grows by 10%.
  • Cost-per-conversation declines by 10–20%.
  • User satisfaction ticks up.

12. The handover document

When you leave this project (job change, promotion, project complete), the next engineer needs:

  • Architecture diagram (data flow, dependencies, costs).
  • Runbooks per alert.
  • Eval suite + how to run it.
  • Prompt versions + change log.
  • Tool definitions + risk assessment.
  • Postmortem archive.
  • Top-3 known issues.

A good handover = your successor productive in week 1, not month 3.

13. Final reality check

Most AI agents in production right now are:

  • Built in 2 weeks for the demo.
  • Stalled at "works for the founder, breaks for users" for 6 months.
  • Replaced with v2 after the team learns what they wished they'd known.

You're now in a position to write v1 properly. That's the difference this curriculum was meant to make.

14. What's next after 48 sessions

  • Pick one of these tracks and go deeper. Ship a side project.
  • Teach what you learned — blog, talk, podcast.
  • Mentor someone earlier on the path.
  • Repeat this kind of structured study every year. The field moves; you must too.

Congratulations on finishing the 48. Now do the work.

Reading material

Books:

  • AI Engineering — Chip Huyen (the canonical book for production LLM systems; the chapter on agents in particular)
  • Designing Machine Learning Systems — Chip Huyen (the foundation book for ML in production; required prior reading)
  • Building LLM-powered Applications — Valentina Alto (practitioner walkthrough of RAG, agents, evals)
  • Site Reliability Engineering + SRE Workbook — Beyer et al. (free Google books; the reliability foundation an agent rests on)
  • Generative AI with LangChain — Ben Auffarth (the most-used production-pattern book for orchestration frameworks)
  • All 47 prior session notes — re-read your own MD files; this is your strongest reference.

Papers:

Official docs:

Blog posts:

In-depth research material

Videos

LeetCode — Design In Memory File System

  • Link: https://leetcode.com/problems/design-in-memory-file-system/
  • Difficulty: Hard
  • Why this problem: A capstone problem too — multiple objects, hierarchies, mutations, persistence. Same shape as building a production system: many pieces, one consistent state. (We saw it in S24; come back to it as a final exam.)
  • Time-box: 45 minutes. This time, don't look at the editorial. You've earned the right to try.

Post-session checklist

By the end of this session — and this 48-session series — you should be able to:

  • Draw the 9-layer reference architecture of a production AI agent from memory.
  • Plan a 1-week / 1-month / 1-quarter build order for a new agent product.
  • Compute the cost of a conversation given token budgets and provider pricing.
  • Stack timeouts, retries, circuit breakers, bulkheads for each external dependency.
  • Wire evaluation, safety, observability into the release process.
  • Run a postmortem, file action items, close the loop.
  • Pick the right tools/architecture from each prior session, with justification.
  • Solve design-in-memory-file-system once more — a fitting capstone for the series.

🎓 You've finished the 48 sessions. Now build something.


Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.