ai mladvanced 12m2026-06-09

Day 11 — Function Calling, Tool Use, and Agentic Loops

Tool calling turns LLMs from text generators into autonomous workers. Mastering the agent loop (plan → call → observe → continue) is the bedrock of every Copilo…

A "tool" is just a JSON schema the model can fill out; the runtime executes it and feeds the result back. This minimal contract is what makes agents general-purpose. Below is how to build one that actually works.

🧠 Concept

Why it matters & the mental model.

1. The loop

Implement it as: while the assistant message contains tool_use blocks, execute each tool, append tool_result blocks, call the model again. Stop when the model emits only text.

2. Tool design — the under-rated 80%

The single biggest predictor of agent quality is how clearly your tools are described and shaped. Rules:

One tool, one verb (search_jira_issues, not jira_do).
Schema field names are documentation: query: str # Lucene query string not q.
Optional parameters with sensible defaults; never required if there's a 90% case.
Error messages should tell the model what to do next ("since must be ISO-8601, e.g. 2025-01-15").
Idempotent where possible; if not, return the existing id on retry.

3. Native tool calling vs ReAct prompting

Native (Anthropic tools=, OpenAI tools=, Gemini): model returns structured JSON in a dedicated channel; runtime parses cleanly. Use this.
ReAct text prompting ("Thought: …\nAction: …\nObservation: …"): the original 2022 trick, still useful for models without native tools, but fragile (parsing errors, format drift).

4. Planning patterns

Tool-calling loop (default): model decides next step each turn — flexible, brittle on 10+ steps.
Plan-and-execute: model emits a full plan first (list of steps), executor runs each step (possibly with sub-agent), model re-plans on failure. Cuts steps and cost.
Reflexion: after a failed attempt, model writes a self-critique and retries with that context. Good for code/math.
Tree of Thoughts / MCTS: explore branches with a scoring function. Expensive, useful for hard reasoning.

🛠 Deep Dive

Internals, code, architecture.

5. Memory

Scratchpad: keep all tool calls + observations in context. Fine up to ~30 steps.
Summarised memory: every N turns, replace older turns with a compact summary. Extends horizon.
Episodic / vector memory: store past (task, plan, outcome) tuples in a vector store, retrieve on new tasks.
Structured memory: e.g. a YAML "world model" the agent reads/edits.

6. Guardrails

Max-steps cap (e.g. 25) with graceful "I couldn't complete; here's what I tried".
Tool whitelist + per-tool rate limit.
Side-effect classes: read-only tools auto-approved, write tools (send_email, charge_card) require explicit confirmation.
Output schemas: validate every tool result with Pydantic; surface schema violations to the model so it can correct.

7. Observability — non-negotiable in production

Log every turn: \{turn, model, input_tokens, output_tokens, tool_calls: [...], latency_ms, cost_usd\}. Use LangSmith / Arize / Helicone / OpenLLMetry or roll a simple JSONL trace. Without traces you cannot debug or improve.

8. Evals for agents

Single-step accuracy isn't enough. Multi-step eval needs:

Task success rate on a held-out set of 50-200 tasks.
Steps to completion (efficiency).
Cost per task ($/task).
Specific behaviour assertions (e.g. "did the agent call confirm_with_user before send_email?"). Frameworks: inspect_ai (UK AISI), langsmith evals, ragas for retrieval-flavoured agents.

🚀 In Practice

Trade-offs, exercises, what to ship today.

9. Multi-agent — only when you need it

Adding agents adds coordination cost and failure surface. Use only when roles are genuinely distinct: a planner + a coder + a reviewer outperforms one bigger agent on long coding tasks (cf. AutoGen, CrewAI, Devin-style swarms). Otherwise stay single-agent + good tools.

10. Cost & latency

Tool calling roughly doubles tokens per turn (system + tools schema + history). Trim tool descriptions aggressively.
Cache static prefixes (Anthropic prompt caching) — agent system prompts get free 90% off on repeat calls.
Stream the final text; never stream tool args (they must be complete).
Parallel tool calls when the model issues multiple in one turn — most APIs now support it.

11. Common failure modes

Symptom	Fix
Loops calling same tool with same args	Detect repeats; inject "you already did X with result Y — try something else."
Hallucinated tool names/fields	Strict schema validation + surface clear error.
Gives up too early	Raise model "temperature" slightly OR use plan-and-execute.
Forgets earlier observation	Summarise older turns; pin key facts in a system note.
Asks user before trying obvious tool	System prompt: "Prefer tool use over questions when you have enough info."

12. What to take away

"Walk me through your agent architecture." Strong answers: model + tool list + loop with bounded steps + memory strategy + eval harness + one specific failure mode you debugged. Bonus: distinguish ReAct from native tool calling.

Key points

Resources

🎥 Anthropic — Building Effective Agents
📖 Anthropic — Building Effective Agents (essay)
📖 OpenAI — Function calling guide
📖 ReAct paper — Yao et al.

Practice Problem: Decode String (Medium)

← previous

Day 10 — Concurrency Models — Threads, Asyncio, GIL, Actors

Day 12 — Lakehouse Architecture — Delta Lake / Iceberg / Hudi, ACID on Object Storage