Tool Use and Function Calling

What

A mechanism that lets LLMs invoke external functions, APIs, and code during generation. The model decides when it needs a capability beyond text (math, data lookup, code execution), emits a structured tool call, receives the result, and incorporates it into its response. This transforms LLMs from text generators into general-purpose reasoning engines that can act on the world.

Why It Matters

  • Grounds LLMs in reality: models can look up current data instead of relying on stale training knowledge. A weather question gets a real API call, not a hallucinated answer
  • Extends capabilities: LLMs can’t do reliable arithmetic, but they can call a calculator. They can’t access databases, but they can call SQL queries
  • Enables agents: tool use is the foundation of AI Agents — multi-step workflows where the model plans, acts, observes, and iterates
  • Production integration: every major API provider (Anthropic, OpenAI, Google) now supports function calling natively, making it the standard way to integrate LLMs into applications

How It Works

The Tool Use Loop

1. User sends a message
2. Model receives message + tool definitions (JSON schemas)
3. Model generates a response that may include a tool_use block:
   {"name": "get_price", "input": {"ticker": "AAPL"}}
4. Client executes the tool, sends result back as tool_result
5. Model incorporates the result and continues generating
6. Repeat steps 3-5 as needed (multi-tool, multi-turn)

Defining Tools

Tools are defined as JSON schemas that tell the model what functions are available, what they do, and what parameters they accept:

tools = [
    {
        "name": "search_database",
        "description": "Search the product database by name or category. "
                       "Returns matching products with prices.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "Search term (product name or category)"
                },
                "max_results": {
                    "type": "integer",
                    "description": "Maximum results to return (default 5)"
                }
            },
            "required": ["query"]
        }
    }
]

Good tool descriptions are critical. The model uses them to decide when and how to call a tool. Vague descriptions lead to wrong tool calls.

Model Context Protocol (MCP)

An open standard (Anthropic, Nov 2024, now under Linux Foundation) that standardizes how LLMs connect to external tools and data:

  • MCP Server exposes: Tools (callable functions), Resources (data/files), Prompts (templates)
  • Transports: Stdio (local processes), SSE, Streamable HTTPS (remote servers)
  • Adopted by: OpenAI, Google, Microsoft, Cursor, Sourcegraph, and many more
  • Key benefit: write a tool server once, use it with any MCP-compatible client
Client (Claude, ChatGPT, IDE) ←→ MCP Protocol ←→ Server (your tools)
                                    │
                          Tools: search, execute, query
                          Resources: files, databases
                          Prompts: templates

Structured Output

Forcing models to output valid JSON (not just hoping they do):

ApproachHow it worksUsed by
API-level enforcementSchema provided in API call, output guaranteed validAnthropic, OpenAI, Google
Constrained decodingToken masking at each step — only valid tokens allowedOutlines, XGrammar, vLLM
Grammar-guidedContext-free grammar defines valid outputsllama.cpp, GBNF

Constrained decoding works by building a state machine from the JSON schema and masking out invalid tokens at each generation step. Zero probability of malformed output.

Code Example

Tool Use with Anthropic API

import anthropic
import json
 
client = anthropic.Anthropic()  # uses ANTHROPIC_API_KEY env var
 
# Define tools
tools = [
    {
        "name": "calculate",
        "description": "Evaluate a mathematical expression. "
                       "Use for any arithmetic the user asks about.",
        "input_schema": {
            "type": "object",
            "properties": {
                "expression": {
                    "type": "string",
                    "description": "Math expression to evaluate, e.g. '2**10 + 3*7'"
                }
            },
            "required": ["expression"]
        }
    },
    {
        "name": "get_weather",
        "description": "Get current weather for a city.",
        "input_schema": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name"}
            },
            "required": ["city"]
        }
    }
]
 
def execute_tool(name, input_data):
    """Execute a tool call and return the result."""
    if name == "calculate":
        # In production, use a sandboxed evaluator
        result = eval(input_data["expression"])
        return str(result)
    elif name == "get_weather":
        # Stub -- in production, call a weather API
        return json.dumps({"temp_c": 12, "condition": "partly cloudy"})
 
def chat_with_tools(user_message):
    messages = [{"role": "user", "content": user_message}]
 
    # Initial call -- model may request tool use
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        tools=tools,
        messages=messages,
    )
 
    # Handle tool use loop
    while response.stop_reason == "tool_use":
        # Extract tool calls from response
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                result = execute_tool(block.name, block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": result,
                })
 
        # Send tool results back
        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user", "content": tool_results})
 
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            tools=tools,
            messages=messages,
        )
 
    # Extract final text
    return "".join(b.text for b in response.content if b.type == "text")
 
# Example
answer = chat_with_tools("What's 2^20 + the temperature in Tallinn?")
print(answer)

Key Tradeoffs

DecisionOption AOption B
Tool granularityMany specific tools (precise)Few general tools (flexible but ambiguous)
ExecutionClient-side (simple, controlled)Server-side MCP (reusable, standardized)
ValidationTrust model output (fast)Schema validation before execution (safe)
ParallelismSequential tool calls (simple)Parallel tool calls (faster, harder to debug)

Common Pitfalls

  • Vague tool descriptions: “Do stuff with data” — the model won’t know when to use it. Write descriptions as if explaining to a new engineer
  • Too many tools: 50+ tools confuse the model. Group related tools or use routing (pick relevant tools per query)
  • No error handling: tool calls fail. Return clear error messages so the model can retry or explain the failure
  • Executing arbitrary code: eval() on model-generated expressions is a security risk. Sandbox all code execution
  • Ignoring tool call cost: each tool call is an extra API round-trip. Design tools to return useful data in one call rather than requiring 5 sequential calls

Exercises

  1. Build a simple tool-use loop: define a search_wikipedia tool (use the Wikipedia API), connect it to Claude or OpenAI, and ask questions that require lookup
  2. Create an MCP server (using mcp Python SDK) that exposes a SQLite database as tools: query_table, list_tables, describe_table. Connect it to Claude Code
  3. Implement constrained decoding: given a simple JSON schema {"name": str, "age": int}, write a token-masking function that only allows valid tokens at each position
  4. Build a multi-tool agent: give the model a calculator, a web search tool, and a file-write tool. Ask it to research a topic, compute some statistics, and save a summary

Self-Test Questions

  1. Explain the tool-use loop in 4 steps. Why does the model need to see the tool result before continuing?
  2. What makes a good tool description? What happens with vague descriptions?
  3. How does constrained decoding guarantee valid JSON output?
  4. What is MCP and why is it better than each application defining its own tool format?
  5. Why is eval() on model-generated code dangerous? What’s the safe alternative?