Search Tech Journey

Find topics, journeys and posts

back to blog
ai mlintermediate 12m2026-06-09

Function Calling, Tool Use, Agentic Loops

Session 21 of the 48-session learning series.

Why this session matters

This is Session 21 of 48 in the LLMs & Agents track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.

Agenda

  • From chat to tools — function-calling primitives in modern APIs
  • The agent loop — plan, act, observe, repeat
  • Designing tools — schemas, idempotency, error surfaces
  • Stopping the loop — budgets, max-iterations, escape valves
  • Patterns — ReAct, Plan-and-Execute, Multi-Agent, Tool Routing

Pre-read (skim before the session)

Deep dive

1. From completion to call

The base LLM emits text. Function calling = teach it to emit structured JSON that names a tool and its arguments, instead of (or in addition to) prose.

{
  "tool_calls": [{
    "id": "call_a1",
    "type": "function",
    "function": {
      "name": "get_weather",
      "arguments": "{\"city\": \"Hyderabad\"}"
    }
  }]
}

You parse, execute, feed the result back as a "tool" message; the model picks up where it left off. This loop is the substrate of every "agent".

2. Anatomy of an agent step

loop:
  messages: [system, user, ... assistant tool_call, tool result, ...]
  response = llm.complete(messages, tools=TOOLS)
  if response.tool_calls:
     for call in response.tool_calls:
         result = dispatch(call)
         messages.append({role: "tool", call_id: call.id, content: result})
     continue
  else:
     return response.content   # final answer

Three things to manage carefully:

  1. The state (messages list — grows; trim or summarise).
  2. The tool dispatch (sync or async; sandbox if untrusted).
  3. The stopping condition (max iterations, budget, idempotency check).

3. Tool schema design

Treat each tool as a small API:

{
  "type": "function",
  "function": {
    "name": "search_orders",
    "description": "Search orders by customer email or order id. Returns up to 20 matching orders.",
    "parameters": {
      "type": "object",
      "properties": {
        "email":    {"type": "string", "description": "Customer email exact match"},
        "order_id": {"type": "string", "description": "Order id, format ORD-12345"}
      },
      "required": []
    }
  }
}

Rules of thumb:

  • Names verb-like and unambiguoussearch_orders not orders.
  • Descriptions matter more than you think. They're the model's only signal. Write them as if for a junior dev.
  • Examples in the description when ambiguous — "format ORD-12345" beats explaining in prose.
  • Few wide tools beat many narrow ones. query_database(sql) is a bad single tool (unsafe + too open). search_orders, get_customer, cancel_order is right.

4. Idempotency and safety

Agents will retry. They will hallucinate. They will call a tool twice with the same arguments after a partial failure. Make every state-changing tool idempotent by accepting a client-supplied request id:

{
  "name": "refund_payment",
  "parameters": {
    "payment_id": "pi_abc",
    "amount": 1000,
    "idempotency_key": "agent_run_xyz_step_3"
  }
}

For destructive actions, require confirmation: tool returns a confirmation token, model asks the user, calls again with token. Or wrap in a human-in-the-loop approval queue.

5. The stopping problem

Naive agent: while-true with no exit. Bad. Always cap:

MAX_ITER = 10
MAX_TOOLS_PER_TURN = 5
TOKEN_BUDGET = 50_000

for i in range(MAX_ITER):
    resp = llm.complete(messages, tools=TOOLS, max_tokens=...)
    if usage.total_tokens > TOKEN_BUDGET: bail("budget")
    if not resp.tool_calls: return resp.content
    if len(resp.tool_calls) > MAX_TOOLS_PER_TURN: bail("too many parallel calls")
    ... dispatch ...

bail("max iterations")

Loop without budget is a billing pager incident waiting to happen.

6. ReAct pattern

Yao et al. 2022 — interleave reasoning and acting. The model produces a "Thought:" before each tool call.

Thought: The user wants their last 3 orders. I should search by email.
Action: search_orders(email="dinesh@example.com")
Observation: [3 orders returned]
Thought: I have the orders. Format them clearly.
Final Answer: Here are your last 3 orders: ...

Modern function-calling APIs implicitly do this. ReAct is more useful when:

  • Tools are expensive (model thinks before calling).
  • You want auditable reasoning trails.

7. Plan-and-Execute

Two-stage agent:

  1. Planner: produces a step list (no tool calls yet).
  2. Executor: walks the steps, calling tools.

Cleaner for multi-step tasks (research, build me X). Cost: extra LLM call upfront. Lose flexibility if mid-plan you learn something new — usually mitigated by allowing re-plan.

8. Multi-agent — when (rarely)

The hype: spin up 5 agents that "collaborate". Reality: each new agent multiplies the cost and the chance of cascade failures. Most workflows are better as one well-tooled agent.

Multi-agent earns its keep when:

  • Specialisation (researcher + writer + critic with very different prompts).
  • Privilege separation (one agent can write code, another only review).
  • Parallel exploration (5 agents try different approaches, pick best).

Default: one agent. Reach for multi-agent when forced.

9. Tool routing — the LLM is the router

When you have 100 tools, the system prompt + schemas explode and accuracy drops. Two patterns:

  • Hierarchical menus — model first picks a category (orders, billing, accounts); category tool returns sub-tools.
  • Retrieval-augmented tools — embed tool descriptions; at each step retrieve top-k relevant tools and only show those to the model.

Both keep the prompt small.

10. Production patterns from real agents

  • Persist tool results. When a step fails mid-loop and you retry, replay observations.
  • Audit log everything. Every prompt, every response, every tool call, every result. Token by token.
  • Streaming — stream the model's text + tool calls; show progress to users. Latency feels half.
  • Cancellation — user closes the chat; cancel any in-flight tool. Hand back partial state.
  • Cost meter — every step prints tokens + $. Production-critical. Without it, agents silently 10x in cost overnight.

11. Evaluations for agents

Agents are notoriously hard to eval (covered more in S25). Quick wins:

  • End-to-end task success — does the agent accomplish a labelled task?
  • Tool-call F1 — did it call the right tool with right args, ignoring extra calls?
  • Step efficiency — did it solve in N steps or 3N?
  • LLM-as-judge for trace quality ("was this reasoning sensible?").

12. Frameworks landscape

FrameworkStyleWhen to pick
LangGraphGraph of nodes (states + transitions)Complex stateful workflows
LlamaIndex agentsTools + memory abstractionsDocument-heavy RAG agents
OpenAI AssistantsHosted, file-handling built-inQuick POC, don't want infra
Anthropic SDK + customBare-metal loopYou want control; 80% of teams
AutoGen, CrewAIMulti-agent firstSpecialised multi-agent flows

My bias: start with a 50-line custom loop. Reach for a framework only when state/graph complexity demands it.

Reading material

Books:

  • AI Engineering — Chip Huyen, 2024 (chs. on agents, tool use, evaluation — the best current overview)
  • Building LLMs for Production — Louis-François Bouchard & Louïs Peters (the chapter on agentic frameworks)
  • Designing Machine Learning Systems — Chip Huyen (the systems thinking pairs perfectly with agents)

Papers:

Official docs:

Blog posts:

In-depth research material

Videos

LeetCode — Basic Calculator Ii

Post-session checklist

By the end of this session you should be able to:

  • Implement a tool-use loop in <50 lines with iteration cap and token budget.
  • Design 3 tool schemas with good descriptions; explain why each is right.
  • Make every state-changing tool idempotent.
  • Explain ReAct vs Plan-and-Execute and when to pick each.
  • List 4 stopping conditions every agent loop must enforce.
  • Solve basic-calculator-ii — stack-based eval, same shape as evaluating a tool chain with precedence.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.