Function Calling, Tool Use, Agentic Loops
Session 21 of the 48-session learning series.
Why this session matters
This is Session 21 of 48 in the LLMs & Agents track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.
Agenda
- From chat to tools — function-calling primitives in modern APIs
- The agent loop — plan, act, observe, repeat
- Designing tools — schemas, idempotency, error surfaces
- Stopping the loop — budgets, max-iterations, escape valves
- Patterns — ReAct, Plan-and-Execute, Multi-Agent, Tool Routing
Pre-read (skim before the session)
- OpenAI — Function calling guide
- Anthropic — Claude tool use
- ReAct: Synergizing Reasoning and Acting (Yao et al., 2022)
- Anthropic — Building effective agents (2024)
Deep dive
1. From completion to call
The base LLM emits text. Function calling = teach it to emit structured JSON that names a tool and its arguments, instead of (or in addition to) prose.
{
"tool_calls": [{
"id": "call_a1",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"city\": \"Hyderabad\"}"
}
}]
}
You parse, execute, feed the result back as a "tool" message; the model picks up where it left off. This loop is the substrate of every "agent".
2. Anatomy of an agent step
loop:
messages: [system, user, ... assistant tool_call, tool result, ...]
response = llm.complete(messages, tools=TOOLS)
if response.tool_calls:
for call in response.tool_calls:
result = dispatch(call)
messages.append({role: "tool", call_id: call.id, content: result})
continue
else:
return response.content # final answer
Three things to manage carefully:
- The state (messages list — grows; trim or summarise).
- The tool dispatch (sync or async; sandbox if untrusted).
- The stopping condition (max iterations, budget, idempotency check).
3. Tool schema design
Treat each tool as a small API:
{
"type": "function",
"function": {
"name": "search_orders",
"description": "Search orders by customer email or order id. Returns up to 20 matching orders.",
"parameters": {
"type": "object",
"properties": {
"email": {"type": "string", "description": "Customer email exact match"},
"order_id": {"type": "string", "description": "Order id, format ORD-12345"}
},
"required": []
}
}
}
Rules of thumb:
- Names verb-like and unambiguous —
search_ordersnotorders. - Descriptions matter more than you think. They're the model's only signal. Write them as if for a junior dev.
- Examples in the description when ambiguous — "format ORD-12345" beats explaining in prose.
- Few wide tools beat many narrow ones.
query_database(sql)is a bad single tool (unsafe + too open).search_orders,get_customer,cancel_orderis right.
4. Idempotency and safety
Agents will retry. They will hallucinate. They will call a tool twice with the same arguments after a partial failure. Make every state-changing tool idempotent by accepting a client-supplied request id:
{
"name": "refund_payment",
"parameters": {
"payment_id": "pi_abc",
"amount": 1000,
"idempotency_key": "agent_run_xyz_step_3"
}
}
For destructive actions, require confirmation: tool returns a confirmation token, model asks the user, calls again with token. Or wrap in a human-in-the-loop approval queue.
5. The stopping problem
Naive agent: while-true with no exit. Bad. Always cap:
MAX_ITER = 10
MAX_TOOLS_PER_TURN = 5
TOKEN_BUDGET = 50_000
for i in range(MAX_ITER):
resp = llm.complete(messages, tools=TOOLS, max_tokens=...)
if usage.total_tokens > TOKEN_BUDGET: bail("budget")
if not resp.tool_calls: return resp.content
if len(resp.tool_calls) > MAX_TOOLS_PER_TURN: bail("too many parallel calls")
... dispatch ...
bail("max iterations")
Loop without budget is a billing pager incident waiting to happen.
6. ReAct pattern
Yao et al. 2022 — interleave reasoning and acting. The model produces a "Thought:" before each tool call.
Thought: The user wants their last 3 orders. I should search by email.
Action: search_orders(email="dinesh@example.com")
Observation: [3 orders returned]
Thought: I have the orders. Format them clearly.
Final Answer: Here are your last 3 orders: ...
Modern function-calling APIs implicitly do this. ReAct is more useful when:
- Tools are expensive (model thinks before calling).
- You want auditable reasoning trails.
7. Plan-and-Execute
Two-stage agent:
- Planner: produces a step list (no tool calls yet).
- Executor: walks the steps, calling tools.
Cleaner for multi-step tasks (research, build me X). Cost: extra LLM call upfront. Lose flexibility if mid-plan you learn something new — usually mitigated by allowing re-plan.
8. Multi-agent — when (rarely)
The hype: spin up 5 agents that "collaborate". Reality: each new agent multiplies the cost and the chance of cascade failures. Most workflows are better as one well-tooled agent.
Multi-agent earns its keep when:
- Specialisation (researcher + writer + critic with very different prompts).
- Privilege separation (one agent can write code, another only review).
- Parallel exploration (5 agents try different approaches, pick best).
Default: one agent. Reach for multi-agent when forced.
9. Tool routing — the LLM is the router
When you have 100 tools, the system prompt + schemas explode and accuracy drops. Two patterns:
- Hierarchical menus — model first picks a category (
orders,billing,accounts); category tool returns sub-tools. - Retrieval-augmented tools — embed tool descriptions; at each step retrieve top-k relevant tools and only show those to the model.
Both keep the prompt small.
10. Production patterns from real agents
- Persist tool results. When a step fails mid-loop and you retry, replay observations.
- Audit log everything. Every prompt, every response, every tool call, every result. Token by token.
- Streaming — stream the model's text + tool calls; show progress to users. Latency feels half.
- Cancellation — user closes the chat; cancel any in-flight tool. Hand back partial state.
- Cost meter — every step prints tokens + $. Production-critical. Without it, agents silently 10x in cost overnight.
11. Evaluations for agents
Agents are notoriously hard to eval (covered more in S25). Quick wins:
- End-to-end task success — does the agent accomplish a labelled task?
- Tool-call F1 — did it call the right tool with right args, ignoring extra calls?
- Step efficiency — did it solve in N steps or 3N?
- LLM-as-judge for trace quality ("was this reasoning sensible?").
12. Frameworks landscape
| Framework | Style | When to pick |
|---|---|---|
| LangGraph | Graph of nodes (states + transitions) | Complex stateful workflows |
| LlamaIndex agents | Tools + memory abstractions | Document-heavy RAG agents |
| OpenAI Assistants | Hosted, file-handling built-in | Quick POC, don't want infra |
| Anthropic SDK + custom | Bare-metal loop | You want control; 80% of teams |
| AutoGen, CrewAI | Multi-agent first | Specialised multi-agent flows |
My bias: start with a 50-line custom loop. Reach for a framework only when state/graph complexity demands it.
Reading material
Books:
- AI Engineering — Chip Huyen, 2024 (chs. on agents, tool use, evaluation — the best current overview)
- Building LLMs for Production — Louis-François Bouchard & Louïs Peters (the chapter on agentic frameworks)
- Designing Machine Learning Systems — Chip Huyen (the systems thinking pairs perfectly with agents)
Papers:
- ReAct: Synergizing Reasoning and Acting in Language Models — Yao et al. 2022 — the thought→action→observation loop every agent now uses.
- Toolformer: Language Models Can Teach Themselves to Use Tools — Schick et al. 2023 — self-supervised tool use.
- Reflexion: Language Agents with Verbal Reinforcement Learning — Shinn et al. 2023 — self-critique loop for agents.
- Voyager: An Open-Ended Embodied Agent with LLMs — Wang et al. 2023 — the GPT-4 Minecraft agent with skill library.
- SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering — Yang et al. 2024 (Princeton) — the agent that solves real GitHub issues.
Official docs:
- OpenAI — Function calling guide — the JSON-schema tool-call API.
- Anthropic — Tool use docs — Claude's tool-calling reference.
- Anthropic — Building effective agents — design patterns (workflows vs agents) from Anthropic.
- LangGraph — docs — stateful, graph-based agent framework.
- Model Context Protocol (MCP) — Anthropic spec — the emerging standard for tool servers.
Blog posts:
- The Bitter Lesson of LLM Agents — Simon Willison — the practitioner's pithy summary of Anthropic's post.
- Lilian Weng — LLM Powered Autonomous Agents — the canonical agent survey post; planning, memory, tools.
- OpenAI — A practical guide to building agents — OpenAI's own playbook (2025).
In-depth research material
- SWE-agent — github.com/princeton-nlp/SWE-agent — ~17k ★, the Princeton agent that solves SWE-bench issues.
- Open Interpreter — github.com/OpenInterpreter/open-interpreter — ~59k ★, local code-execution agent.
- smolagents (HF) — github.com/huggingface/smolagents — ~22k ★, minimalist code-action agents from HF.
- crewAI — github.com/crewAIInc/crewAI — ~37k ★, multi-agent orchestration with roles.
- LangGraph — github.com/langchain-ai/langgraph — ~19k ★, the graph-state agent runtime.
- Tau-bench — Sierra (Bret Taylor's company) — the canonical real-world agent benchmark.
- GAIA: A Benchmark for General AI Assistants — Mialon et al. 2023 — the agentic benchmark Meta+HF use.
- Computer Use API — Anthropic blog — screenshots + mouse/keyboard as a tool.
- Hamel Husain — Your AI Product Needs Evals — why agent evals are the hardest part.
- Cognition — Don't Build Multi-Agents — the influential counter-take from the Devin team.
Videos
- How I Use "AI Agents" (Real World Examples) — Andrej Karpathy — Andrej Karpathy · 39 min — the practitioner's view from a co-author of GPT.
- Agentic Design Patterns — Andrew Ng keynote — Andrew Ng · 30 min — reflection, tool use, planning, multi-agent: the four-pattern taxonomy.
- Building Effective Agents — Anthropic (Erik Schluntz) — 38 min — walk-through of the Anthropic post by one of its authors.
- State of GPT — Andrej Karpathy (Microsoft Build) — 42 min — the agent-loop framing in his GPT tutorial; still the clearest explainer.
- Function Calling and ReAct in Practice — Jerry Liu (LlamaIndex) — 49 min — from the founder of LlamaIndex; hands-on examples.
LeetCode — Basic Calculator Ii
- Link: https://leetcode.com/problems/basic-calculator-ii/
- Difficulty: Medium
- Why this problem: Stack-based eval respecting precedence — same logic an agent uses to evaluate tool chains.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Implement a tool-use loop in <50 lines with iteration cap and token budget.
- Design 3 tool schemas with good descriptions; explain why each is right.
- Make every state-changing tool idempotent.
- Explain ReAct vs Plan-and-Execute and when to pick each.
- List 4 stopping conditions every agent loop must enforce.
- Solve
basic-calculator-ii— stack-based eval, same shape as evaluating a tool chain with precedence.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.