ai mlintermediate 12m2026-06-09

Function Calling, Tool Use, Agentic Loops

Session 21 of the 48-session learning series.

Why this session matters

This is Session 21 of 48 in the LLMs & Agents track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.

Agenda

From chat to tools — function-calling primitives in modern APIs
The agent loop — plan, act, observe, repeat
Designing tools — schemas, idempotency, error surfaces
Stopping the loop — budgets, max-iterations, escape valves
Patterns — ReAct, Plan-and-Execute, Multi-Agent, Tool Routing

Pre-read (skim before the session)

Deep dive

1. From completion to call

The base LLM emits text. Function calling = teach it to emit structured JSON that names a tool and its arguments, instead of (or in addition to) prose.

{
  "tool_calls": [{
    "id": "call_a1",
    "type": "function",
    "function": {
      "name": "get_weather",
      "arguments": "{\"city\": \"Hyderabad\"}"
    }
  }]
}

You parse, execute, feed the result back as a "tool" message; the model picks up where it left off. This loop is the substrate of every "agent".

2. Anatomy of an agent step

loop:
  messages: [system, user, ... assistant tool_call, tool result, ...]
  response = llm.complete(messages, tools=TOOLS)
  if response.tool_calls:
     for call in response.tool_calls:
         result = dispatch(call)
         messages.append({role: "tool", call_id: call.id, content: result})
     continue
  else:
     return response.content   # final answer

Three things to manage carefully:

The state (messages list — grows; trim or summarise).
The tool dispatch (sync or async; sandbox if untrusted).
The stopping condition (max iterations, budget, idempotency check).

3. Tool schema design

Treat each tool as a small API:

{
  "type": "function",
  "function": {
    "name": "search_orders",
    "description": "Search orders by customer email or order id. Returns up to 20 matching orders.",
    "parameters": {
      "type": "object",
      "properties": {
        "email":    {"type": "string", "description": "Customer email exact match"},
        "order_id": {"type": "string", "description": "Order id, format ORD-12345"}
      },
      "required": []
    }
  }
}

Rules of thumb:

Names verb-like and unambiguous — search_orders not orders.
Descriptions matter more than you think. They're the model's only signal. Write them as if for a junior dev.
Examples in the description when ambiguous — "format ORD-12345" beats explaining in prose.
Few wide tools beat many narrow ones. query_database(sql) is a bad single tool (unsafe + too open). search_orders, get_customer, cancel_order is right.

Agents will retry. They will hallucinate. They will call a tool twice with the same arguments after a partial failure. Make every state-changing tool idempotent by accepting a client-supplied request id:

{
  "name": "refund_payment",
  "parameters": {
    "payment_id": "pi_abc",
    "amount": 1000,
    "idempotency_key": "agent_run_xyz_step_3"
  }
}

For destructive actions, require confirmation: tool returns a confirmation token, model asks the user, calls again with token. Or wrap in a human-in-the-loop approval queue.

5. The stopping problem

Naive agent: while-true with no exit. Bad. Always cap:

MAX_ITER = 10
MAX_TOOLS_PER_TURN = 5
TOKEN_BUDGET = 50_000

for i in range(MAX_ITER):
    resp = llm.complete(messages, tools=TOOLS, max_tokens=...)
    if usage.total_tokens > TOKEN_BUDGET: bail("budget")
    if not resp.tool_calls: return resp.content
    if len(resp.tool_calls) > MAX_TOOLS_PER_TURN: bail("too many parallel calls")
    ... dispatch ...

bail("max iterations")

Loop without budget is a billing pager incident waiting to happen.

6. ReAct pattern

Yao et al. 2022 — interleave reasoning and acting. The model produces a "Thought:" before each tool call.

Thought: The user wants their last 3 orders. I should search by email.
Action: search_orders(email="dinesh@example.com")
Observation: [3 orders returned]
Thought: I have the orders. Format them clearly.
Final Answer: Here are your last 3 orders: ...

Modern function-calling APIs implicitly do this. ReAct is more useful when:

Tools are expensive (model thinks before calling).
You want auditable reasoning trails.

7. Plan-and-Execute

Two-stage agent:

Planner: produces a step list (no tool calls yet).
Executor: walks the steps, calling tools.

Cleaner for multi-step tasks (research, build me X). Cost: extra LLM call upfront. Lose flexibility if mid-plan you learn something new — usually mitigated by allowing re-plan.

8. Multi-agent — when (rarely)

The hype: spin up 5 agents that "collaborate". Reality: each new agent multiplies the cost and the chance of cascade failures. Most workflows are better as one well-tooled agent.

Multi-agent earns its keep when:

Specialisation (researcher + writer + critic with very different prompts).
Privilege separation (one agent can write code, another only review).
Parallel exploration (5 agents try different approaches, pick best).

Default: one agent. Reach for multi-agent when forced.

9. Tool routing — the LLM is the router

When you have 100 tools, the system prompt + schemas explode and accuracy drops. Two patterns:

Hierarchical menus — model first picks a category (orders, billing, accounts); category tool returns sub-tools.
Retrieval-augmented tools — embed tool descriptions; at each step retrieve top-k relevant tools and only show those to the model.

Both keep the prompt small.

10. Production patterns from real agents

Persist tool results. When a step fails mid-loop and you retry, replay observations.
Audit log everything. Every prompt, every response, every tool call, every result. Token by token.
Streaming — stream the model's text + tool calls; show progress to users. Latency feels half.
Cancellation — user closes the chat; cancel any in-flight tool. Hand back partial state.
Cost meter — every step prints tokens + $. Production-critical. Without it, agents silently 10x in cost overnight.

11. Evaluations for agents

Agents are notoriously hard to eval (covered more in S25). Quick wins:

End-to-end task success — does the agent accomplish a labelled task?
Tool-call F1 — did it call the right tool with right args, ignoring extra calls?
Step efficiency — did it solve in N steps or 3N?
LLM-as-judge for trace quality ("was this reasoning sensible?").

12. Frameworks landscape

Framework	Style	When to pick
LangGraph	Graph of nodes (states + transitions)	Complex stateful workflows
LlamaIndex agents	Tools + memory abstractions	Document-heavy RAG agents
OpenAI Assistants	Hosted, file-handling built-in	Quick POC, don't want infra
Anthropic SDK + custom	Bare-metal loop	You want control; 80% of teams
AutoGen, CrewAI	Multi-agent first	Specialised multi-agent flows

My bias: start with a 50-line custom loop. Reach for a framework only when state/graph complexity demands it.

Reading material

Books:

AI Engineering — Chip Huyen, 2024 (chs. on agents, tool use, evaluation — the best current overview)
Building LLMs for Production — Louis-François Bouchard & Louïs Peters (the chapter on agentic frameworks)
Designing Machine Learning Systems — Chip Huyen (the systems thinking pairs perfectly with agents)

Papers:

ReAct: Synergizing Reasoning and Acting in Language Models — Yao et al. 2022 — the thought→action→observation loop every agent now uses.
Toolformer: Language Models Can Teach Themselves to Use Tools — Schick et al. 2023 — self-supervised tool use.
Reflexion: Language Agents with Verbal Reinforcement Learning — Shinn et al. 2023 — self-critique loop for agents.
Voyager: An Open-Ended Embodied Agent with LLMs — Wang et al. 2023 — the GPT-4 Minecraft agent with skill library.
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering — Yang et al. 2024 (Princeton) — the agent that solves real GitHub issues.

Official docs:

OpenAI — Function calling guide — the JSON-schema tool-call API.
Anthropic — Tool use docs — Claude's tool-calling reference.
Anthropic — Building effective agents — design patterns (workflows vs agents) from Anthropic.
LangGraph — docs — stateful, graph-based agent framework.
Model Context Protocol (MCP) — Anthropic spec — the emerging standard for tool servers.

Blog posts:

The Bitter Lesson of LLM Agents — Simon Willison — the practitioner's pithy summary of Anthropic's post.
Lilian Weng — LLM Powered Autonomous Agents — the canonical agent survey post; planning, memory, tools.
OpenAI — A practical guide to building agents — OpenAI's own playbook (2025).

In-depth research material

SWE-agent — github.com/princeton-nlp/SWE-agent — ~17k ★, the Princeton agent that solves SWE-bench issues.
Open Interpreter — github.com/OpenInterpreter/open-interpreter — ~59k ★, local code-execution agent.
smolagents (HF) — github.com/huggingface/smolagents — ~22k ★, minimalist code-action agents from HF.
crewAI — github.com/crewAIInc/crewAI — ~37k ★, multi-agent orchestration with roles.
LangGraph — github.com/langchain-ai/langgraph — ~19k ★, the graph-state agent runtime.
Tau-bench — Sierra (Bret Taylor's company) — the canonical real-world agent benchmark.
GAIA: A Benchmark for General AI Assistants — Mialon et al. 2023 — the agentic benchmark Meta+HF use.
Computer Use API — Anthropic blog — screenshots + mouse/keyboard as a tool.
Hamel Husain — Your AI Product Needs Evals — why agent evals are the hardest part.
Cognition — Don't Build Multi-Agents — the influential counter-take from the Devin team.

Videos

How I Use "AI Agents" (Real World Examples) — Andrej Karpathy — Andrej Karpathy · 39 min — the practitioner's view from a co-author of GPT.
Agentic Design Patterns — Andrew Ng keynote — Andrew Ng · 30 min — reflection, tool use, planning, multi-agent: the four-pattern taxonomy.
Building Effective Agents — Anthropic (Erik Schluntz) — 38 min — walk-through of the Anthropic post by one of its authors.
State of GPT — Andrej Karpathy (Microsoft Build) — 42 min — the agent-loop framing in his GPT tutorial; still the clearest explainer.
Function Calling and ReAct in Practice — Jerry Liu (LlamaIndex) — 49 min — from the founder of LlamaIndex; hands-on examples.

LeetCode — Basic Calculator Ii

Link: https://leetcode.com/problems/basic-calculator-ii/
Difficulty: Medium
Why this problem: Stack-based eval respecting precedence — same logic an agent uses to evaluate tool chains.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Implement a tool-use loop in <50 lines with iteration cap and token budget.
Design 3 tool schemas with good descriptions; explain why each is right.
Make every state-changing tool idempotent.
Explain ReAct vs Plan-and-Execute and when to pick each.
List 4 stopping conditions every agent loop must enforce.
Solve basic-calculator-ii — stack-based eval, same shape as evaluating a tool chain with precedence.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

Memory Model, GC, Heap, GC Leaks, Profiling

Designing a Chat System — Connections, Fanout, Storage, Delivery