Tool Use and Function Calling
What
A mechanism that lets LLMs invoke external functions, APIs, and code during generation. The model decides when it needs a capability beyond text (math, data lookup, code execution), emits a structured tool call, receives the result, and incorporates it into its response. This transforms LLMs from text generators into general-purpose reasoning engines that can act on the world.
Why It Matters
- Grounds LLMs in reality: models can look up current data instead of relying on stale training knowledge. A weather question gets a real API call, not a hallucinated answer
- Extends capabilities: LLMs can’t do reliable arithmetic, but they can call a calculator. They can’t access databases, but they can call SQL queries
- Enables agents: tool use is the foundation of AI Agents — multi-step workflows where the model plans, acts, observes, and iterates
- Production integration: every major API provider (Anthropic, OpenAI, Google) now supports function calling natively, making it the standard way to integrate LLMs into applications
How It Works
The Tool Use Loop
1. User sends a message
2. Model receives message + tool definitions (JSON schemas)
3. Model generates a response that may include a tool_use block:
{"name": "get_price", "input": {"ticker": "AAPL"}}
4. Client executes the tool, sends result back as tool_result
5. Model incorporates the result and continues generating
6. Repeat steps 3-5 as needed (multi-tool, multi-turn)
Defining Tools
Tools are defined as JSON schemas that tell the model what functions are available, what they do, and what parameters they accept:
tools = [
{
"name": "search_database",
"description": "Search the product database by name or category. "
"Returns matching products with prices.",
"input_schema": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Search term (product name or category)"
},
"max_results": {
"type": "integer",
"description": "Maximum results to return (default 5)"
}
},
"required": ["query"]
}
}
]Good tool descriptions are critical. The model uses them to decide when and how to call a tool. Vague descriptions lead to wrong tool calls.
Model Context Protocol (MCP)
An open standard (Anthropic, Nov 2024, now under Linux Foundation) that standardizes how LLMs connect to external tools and data:
- MCP Server exposes: Tools (callable functions), Resources (data/files), Prompts (templates)
- Transports: Stdio (local processes), SSE, Streamable HTTPS (remote servers)
- Adopted by: OpenAI, Google, Microsoft, Cursor, Sourcegraph, and many more
- Key benefit: write a tool server once, use it with any MCP-compatible client
Client (Claude, ChatGPT, IDE) ←→ MCP Protocol ←→ Server (your tools)
│
Tools: search, execute, query
Resources: files, databases
Prompts: templates
Structured Output
Forcing models to output valid JSON (not just hoping they do):
| Approach | How it works | Used by |
|---|---|---|
| API-level enforcement | Schema provided in API call, output guaranteed valid | Anthropic, OpenAI, Google |
| Constrained decoding | Token masking at each step — only valid tokens allowed | Outlines, XGrammar, vLLM |
| Grammar-guided | Context-free grammar defines valid outputs | llama.cpp, GBNF |
Constrained decoding works by building a state machine from the JSON schema and masking out invalid tokens at each generation step. Zero probability of malformed output.
Code Example
Tool Use with Anthropic API
import anthropic
import json
client = anthropic.Anthropic() # uses ANTHROPIC_API_KEY env var
# Define tools
tools = [
{
"name": "calculate",
"description": "Evaluate a mathematical expression. "
"Use for any arithmetic the user asks about.",
"input_schema": {
"type": "object",
"properties": {
"expression": {
"type": "string",
"description": "Math expression to evaluate, e.g. '2**10 + 3*7'"
}
},
"required": ["expression"]
}
},
{
"name": "get_weather",
"description": "Get current weather for a city.",
"input_schema": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"}
},
"required": ["city"]
}
}
]
def execute_tool(name, input_data):
"""Execute a tool call and return the result."""
if name == "calculate":
# In production, use a sandboxed evaluator
result = eval(input_data["expression"])
return str(result)
elif name == "get_weather":
# Stub -- in production, call a weather API
return json.dumps({"temp_c": 12, "condition": "partly cloudy"})
def chat_with_tools(user_message):
messages = [{"role": "user", "content": user_message}]
# Initial call -- model may request tool use
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
tools=tools,
messages=messages,
)
# Handle tool use loop
while response.stop_reason == "tool_use":
# Extract tool calls from response
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = execute_tool(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result,
})
# Send tool results back
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
tools=tools,
messages=messages,
)
# Extract final text
return "".join(b.text for b in response.content if b.type == "text")
# Example
answer = chat_with_tools("What's 2^20 + the temperature in Tallinn?")
print(answer)Key Tradeoffs
| Decision | Option A | Option B |
|---|---|---|
| Tool granularity | Many specific tools (precise) | Few general tools (flexible but ambiguous) |
| Execution | Client-side (simple, controlled) | Server-side MCP (reusable, standardized) |
| Validation | Trust model output (fast) | Schema validation before execution (safe) |
| Parallelism | Sequential tool calls (simple) | Parallel tool calls (faster, harder to debug) |
Common Pitfalls
- Vague tool descriptions: “Do stuff with data” — the model won’t know when to use it. Write descriptions as if explaining to a new engineer
- Too many tools: 50+ tools confuse the model. Group related tools or use routing (pick relevant tools per query)
- No error handling: tool calls fail. Return clear error messages so the model can retry or explain the failure
- Executing arbitrary code:
eval()on model-generated expressions is a security risk. Sandbox all code execution - Ignoring tool call cost: each tool call is an extra API round-trip. Design tools to return useful data in one call rather than requiring 5 sequential calls
Exercises
- Build a simple tool-use loop: define a
search_wikipediatool (use the Wikipedia API), connect it to Claude or OpenAI, and ask questions that require lookup - Create an MCP server (using
mcpPython SDK) that exposes a SQLite database as tools:query_table,list_tables,describe_table. Connect it to Claude Code - Implement constrained decoding: given a simple JSON schema
{"name": str, "age": int}, write a token-masking function that only allows valid tokens at each position - Build a multi-tool agent: give the model a calculator, a web search tool, and a file-write tool. Ask it to research a topic, compute some statistics, and save a summary
Self-Test Questions
- Explain the tool-use loop in 4 steps. Why does the model need to see the tool result before continuing?
- What makes a good tool description? What happens with vague descriptions?
- How does constrained decoding guarantee valid JSON output?
- What is MCP and why is it better than each application defining its own tool format?
- Why is
eval()on model-generated code dangerous? What’s the safe alternative?
Links
- AI Agents — agents use tools as part of multi-step workflows
- Modern AI Techniques — where tool use fits in the landscape
- Transformers — the architecture that makes tool use possible
- Key Papers