Day 11 — Function Calling, Tool Use, and Agentic Loops
Tool calling turns LLMs from text generators into autonomous workers. Mastering the agent loop (plan → call → observe → continue) is the bedrock of every Copilo…
A "tool" is just a JSON schema the model can fill out; the runtime executes it and feeds the result back. This minimal contract is what makes agents general-purpose. Below is how to build one that actually works.
🧠 Concept
Why it matters & the mental model.
1. The loop
Implement it as: while the assistant message contains tool_use blocks, execute each tool, append tool_result blocks, call the model again. Stop when the model emits only text.
2. Tool design — the under-rated 80%
The single biggest predictor of agent quality is how clearly your tools are described and shaped. Rules:
- One tool, one verb (
search_jira_issues, notjira_do). - Schema field names are documentation:
query: str # Lucene query stringnotq. - Optional parameters with sensible defaults; never required if there's a 90% case.
- Error messages should tell the model what to do next ("
sincemust be ISO-8601, e.g. 2025-01-15"). - Idempotent where possible; if not, return the existing id on retry.
3. Native tool calling vs ReAct prompting
- Native (Anthropic
tools=, OpenAItools=, Gemini): model returns structured JSON in a dedicated channel; runtime parses cleanly. Use this. - ReAct text prompting ("Thought: …\nAction: …\nObservation: …"): the original 2022 trick, still useful for models without native tools, but fragile (parsing errors, format drift).
4. Planning patterns
- Tool-calling loop (default): model decides next step each turn — flexible, brittle on 10+ steps.
- Plan-and-execute: model emits a full plan first (list of steps), executor runs each step (possibly with sub-agent), model re-plans on failure. Cuts steps and cost.
- Reflexion: after a failed attempt, model writes a self-critique and retries with that context. Good for code/math.
- Tree of Thoughts / MCTS: explore branches with a scoring function. Expensive, useful for hard reasoning.
🛠 Deep Dive
Internals, code, architecture.
5. Memory
- Scratchpad: keep all tool calls + observations in context. Fine up to ~30 steps.
- Summarised memory: every N turns, replace older turns with a compact summary. Extends horizon.
- Episodic / vector memory: store past (task, plan, outcome) tuples in a vector store, retrieve on new tasks.
- Structured memory: e.g. a YAML "world model" the agent reads/edits.
6. Guardrails
- Max-steps cap (e.g. 25) with graceful "I couldn't complete; here's what I tried".
- Tool whitelist + per-tool rate limit.
- Side-effect classes: read-only tools auto-approved, write tools (send_email, charge_card) require explicit confirmation.
- Output schemas: validate every tool result with Pydantic; surface schema violations to the model so it can correct.
7. Observability — non-negotiable in production
Log every turn: \{turn, model, input_tokens, output_tokens, tool_calls: [...], latency_ms, cost_usd\}. Use LangSmith / Arize / Helicone / OpenLLMetry or roll a simple JSONL trace. Without traces you cannot debug or improve.
8. Evals for agents
Single-step accuracy isn't enough. Multi-step eval needs:
- Task success rate on a held-out set of 50-200 tasks.
- Steps to completion (efficiency).
- Cost per task ($/task).
- Specific behaviour assertions (e.g. "did the agent call
confirm_with_userbeforesend_email?"). Frameworks:inspect_ai(UK AISI),langsmith evals,ragasfor retrieval-flavoured agents.
🚀 In Practice
Trade-offs, exercises, what to ship today.
9. Multi-agent — only when you need it
Adding agents adds coordination cost and failure surface. Use only when roles are genuinely distinct: a planner + a coder + a reviewer outperforms one bigger agent on long coding tasks (cf. AutoGen, CrewAI, Devin-style swarms). Otherwise stay single-agent + good tools.
10. Cost & latency
- Tool calling roughly doubles tokens per turn (system + tools schema + history). Trim tool descriptions aggressively.
- Cache static prefixes (Anthropic prompt caching) — agent system prompts get free 90% off on repeat calls.
- Stream the final text; never stream tool args (they must be complete).
- Parallel tool calls when the model issues multiple in one turn — most APIs now support it.
11. Common failure modes
| Symptom | Fix |
|---|---|
| Loops calling same tool with same args | Detect repeats; inject "you already did X with result Y — try something else." |
| Hallucinated tool names/fields | Strict schema validation + surface clear error. |
| Gives up too early | Raise model "temperature" slightly OR use plan-and-execute. |
| Forgets earlier observation | Summarise older turns; pin key facts in a system note. |
| Asks user before trying obvious tool | System prompt: "Prefer tool use over questions when you have enough info." |
12. What to take away
"Walk me through your agent architecture." Strong answers: model + tool list + loop with bounded steps + memory strategy + eval harness + one specific failure mode you debugged. Bonus: distinguish ReAct from native tool calling.
Resources
- 🎥 Anthropic — Building Effective Agents
- 📖 Anthropic — Building Effective Agents (essay)
- 📖 OpenAI — Function calling guide
- 📖 ReAct paper — Yao et al.
Practice Problem: Decode String (Medium)