Search Tech Journey

Find topics, journeys and posts

back to blog
ai mladvanced 12m2026-06-09

Day 11 — Function Calling, Tool Use, and Agentic Loops

Tool calling turns LLMs from text generators into autonomous workers. Mastering the agent loop (plan → call → observe → continue) is the bedrock of every Copilo…

A "tool" is just a JSON schema the model can fill out; the runtime executes it and feeds the result back. This minimal contract is what makes agents general-purpose. Below is how to build one that actually works.

🧠 Concept

Why it matters & the mental model.

1. The loop

Implement it as: while the assistant message contains tool_use blocks, execute each tool, append tool_result blocks, call the model again. Stop when the model emits only text.

2. Tool design — the under-rated 80%

The single biggest predictor of agent quality is how clearly your tools are described and shaped. Rules:

  • One tool, one verb (search_jira_issues, not jira_do).
  • Schema field names are documentation: query: str # Lucene query string not q.
  • Optional parameters with sensible defaults; never required if there's a 90% case.
  • Error messages should tell the model what to do next ("since must be ISO-8601, e.g. 2025-01-15").
  • Idempotent where possible; if not, return the existing id on retry.

3. Native tool calling vs ReAct prompting

  • Native (Anthropic tools=, OpenAI tools=, Gemini): model returns structured JSON in a dedicated channel; runtime parses cleanly. Use this.
  • ReAct text prompting ("Thought: …\nAction: …\nObservation: …"): the original 2022 trick, still useful for models without native tools, but fragile (parsing errors, format drift).

4. Planning patterns

  • Tool-calling loop (default): model decides next step each turn — flexible, brittle on 10+ steps.
  • Plan-and-execute: model emits a full plan first (list of steps), executor runs each step (possibly with sub-agent), model re-plans on failure. Cuts steps and cost.
  • Reflexion: after a failed attempt, model writes a self-critique and retries with that context. Good for code/math.
  • Tree of Thoughts / MCTS: explore branches with a scoring function. Expensive, useful for hard reasoning.

🛠 Deep Dive

Internals, code, architecture.

5. Memory

  • Scratchpad: keep all tool calls + observations in context. Fine up to ~30 steps.
  • Summarised memory: every N turns, replace older turns with a compact summary. Extends horizon.
  • Episodic / vector memory: store past (task, plan, outcome) tuples in a vector store, retrieve on new tasks.
  • Structured memory: e.g. a YAML "world model" the agent reads/edits.

6. Guardrails

  • Max-steps cap (e.g. 25) with graceful "I couldn't complete; here's what I tried".
  • Tool whitelist + per-tool rate limit.
  • Side-effect classes: read-only tools auto-approved, write tools (send_email, charge_card) require explicit confirmation.
  • Output schemas: validate every tool result with Pydantic; surface schema violations to the model so it can correct.

7. Observability — non-negotiable in production

Log every turn: \{turn, model, input_tokens, output_tokens, tool_calls: [...], latency_ms, cost_usd\}. Use LangSmith / Arize / Helicone / OpenLLMetry or roll a simple JSONL trace. Without traces you cannot debug or improve.

8. Evals for agents

Single-step accuracy isn't enough. Multi-step eval needs:

  • Task success rate on a held-out set of 50-200 tasks.
  • Steps to completion (efficiency).
  • Cost per task ($/task).
  • Specific behaviour assertions (e.g. "did the agent call confirm_with_user before send_email?"). Frameworks: inspect_ai (UK AISI), langsmith evals, ragas for retrieval-flavoured agents.

🚀 In Practice

Trade-offs, exercises, what to ship today.

9. Multi-agent — only when you need it

Adding agents adds coordination cost and failure surface. Use only when roles are genuinely distinct: a planner + a coder + a reviewer outperforms one bigger agent on long coding tasks (cf. AutoGen, CrewAI, Devin-style swarms). Otherwise stay single-agent + good tools.

10. Cost & latency

  • Tool calling roughly doubles tokens per turn (system + tools schema + history). Trim tool descriptions aggressively.
  • Cache static prefixes (Anthropic prompt caching) — agent system prompts get free 90% off on repeat calls.
  • Stream the final text; never stream tool args (they must be complete).
  • Parallel tool calls when the model issues multiple in one turn — most APIs now support it.

11. Common failure modes

SymptomFix
Loops calling same tool with same argsDetect repeats; inject "you already did X with result Y — try something else."
Hallucinated tool names/fieldsStrict schema validation + surface clear error.
Gives up too earlyRaise model "temperature" slightly OR use plan-and-execute.
Forgets earlier observationSummarise older turns; pin key facts in a system note.
Asks user before trying obvious toolSystem prompt: "Prefer tool use over questions when you have enough info."

12. What to take away

"Walk me through your agent architecture." Strong answers: model + tool list + loop with bounded steps + memory strategy + eval harness + one specific failure mode you debugged. Bonus: distinguish ReAct from native tool calling.

Key points

    Resources

    Practice Problem: Decode String (Medium)