Tool Calling Patterns That Actually Work
Registry pattern, Pydantic validation, timeout handling, and the retry strategies that matter in production.
Tool Calling Patterns That Actually Work
Tool calling is where agents go from “impressive demo” to “useful software.” It is also where most agent projects fall apart. The model hallucinates parameters, tools time out silently, errors get swallowed, and the agent loops forever calling the same broken tool. This guide covers the patterns that survive contact with real users.
We will build up from raw tool definitions to a full registry with validation, timeouts, retries, and proper error formatting. Every code example is production Python — no pseudocode, no “exercise for the reader.”
Why tool calling is the hardest part to get right
When Claude makes a tool call, three things can go wrong that do not happen in normal API usage:
-
The model sends bad input. Claude is good at following schemas, but it is not perfect. It might send a string where you expect an integer, omit a required field, or pass an invalid enum value. If your tool function just crashes, the agent has no way to recover.
-
The tool itself fails. Network timeouts, database errors, rate limits, file not found — all the normal failure modes of software. But now they are happening inside an automated loop where nobody is watching.
-
The loop never terminates. The model calls a tool, gets an error, tries again with the same input, gets the same error, and repeats until you hit your token budget. Without explicit loop control, this will happen.
Getting tool calling right means handling all three failure modes explicitly.
The registry pattern
The naive approach is a big if/elif chain:
if tool_name == "search":
result = search(tool_input["query"])
elif tool_name == "get_user":
result = get_user(tool_input["user_id"])
elif tool_name == "send_email":
result = send_email(tool_input["to"], tool_input["subject"], tool_input["body"])
This works for two tools. At ten tools it is unmaintainable. The registry pattern solves this by storing tools as data:
from core.tools import ToolRegistry
registry = ToolRegistry()
@registry.register(
name="search_knowledge_base",
description="Search the knowledge base for articles relevant to the user's question.",
input_schema={
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"}
},
"required": ["query"]
}
)
def search_kb(query: str) -> str:
results = search(query)
if not results:
return "No relevant articles found."
return "\n\n".join(f"**{r['title']}**\n{r['content']}" for r in results)
The ToolRegistry stores each tool’s name, description, function reference, schema, and configuration. When it is time to send tools to Claude, one call gives you everything:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system="You are a helpful support agent.",
tools=registry.to_claude_tools(),
messages=messages,
)
to_claude_tools() iterates over all registered tools and formats them using each tool’s to_claude_schema() method, producing the exact list-of-dicts format Claude expects. No manual schema wrangling.
You can also skip the decorator and register functions directly:
registry.register_function(
func=search_kb,
name="search_knowledge_base",
description="Search the knowledge base",
input_schema={...},
timeout_seconds=10.0,
max_retries=2,
)
This is useful when you are registering third-party functions or building tools dynamically from configuration.
Input validation with Pydantic models
Raw dicts break in production. The model sends {"max_results": "five"} instead of {"max_results": 5}, your function throws a TypeError deep in business logic, and the error message is useless to both the model and your logs.
Pydantic validation catches these problems at the boundary, before your tool code runs:
from pydantic import BaseModel, Field
class SearchInput(BaseModel):
query: str = Field(description="Search query based on the user's question")
max_results: int = Field(default=5, ge=1, le=20)
category: str | None = Field(default=None, description="Optional category filter")
@registry.register(
name="search_knowledge_base",
description="Search the knowledge base for relevant articles.",
input_model=SearchInput,
)
def search_kb(query: str, max_results: int = 5, category: str | None = None) -> str:
# By the time we get here, inputs are guaranteed valid
results = search(query, max_results=max_results, category=category)
return format_results(results)
When you pass input_model instead of input_schema, the registry auto-generates the JSON Schema from your Pydantic model using model_json_schema(). One source of truth for both validation and the schema sent to Claude.
Here is what happens inside the registry when a tool is called with an input_model:
def _validate_input(self, tool: Tool, tool_input: dict) -> dict | ToolResult:
if tool.input_model is not None:
try:
validated = tool.input_model(**tool_input)
return validated.model_dump()
except PydanticValidationError as e:
error_msgs = "; ".join(
f"{'.'.join(str(loc) for loc in err['loc'])}: {err['msg']}"
for err in e.errors()
)
return ToolResult(
tool_use_id="",
content=f"Validation error: {error_msgs}. Please fix the input and try again.",
is_error=True,
)
return tool_input
If validation fails, the tool never executes. Instead, the registry returns a ToolResult with is_error=True and a clear message telling Claude what went wrong. Claude can then fix its input and retry — this is self-healing behavior you get for free.
The key insight: Pydantic gives you default values, type coercion, range checks, and readable error messages. Raw dicts give you KeyError: 'query' at 3am.
Timeout handling
Some tools talk to external services. APIs go down, databases hang, DNS resolves slowly. Without a timeout, your agent loop blocks indefinitely on a single tool call.
The registry uses SIGALRM on Unix systems for clean timeout enforcement:
def _execute_with_timeout(self, tool: Tool, validated_input: dict) -> Any:
try:
old_handler = signal.signal(signal.SIGALRM, self._timeout_handler)
signal.alarm(int(tool.timeout_seconds))
try:
result = tool.function(**validated_input)
finally:
signal.alarm(0)
signal.signal(signal.SIGALRM, old_handler)
except AttributeError:
# SIGALRM not available (Windows) — run without timeout
result = tool.function(**validated_input)
return result
@staticmethod
def _timeout_handler(signum, frame):
raise TimeoutError("Tool execution timed out")
A few things to note:
- Always restore the old handler. The
finallyblock ensures you do not corrupt signal handling for the rest of the process, even if the tool raises an exception. - Graceful fallback on Windows.
SIGALRMis Unix-only. On Windows, the tool runs without a timeout rather than crashing. For cross-platform production code, you would useconcurrent.futureswith a thread pool instead. - Set timeouts per tool. A knowledge base search might need 10 seconds. An email send might need 30. A local computation needs 2. Configure this when you register each tool:
@registry.register(
name="search_knowledge_base",
description="Search the knowledge base",
input_schema={...},
timeout_seconds=10.0, # Fast fail for search
)
def search_kb(query: str) -> str:
...
@registry.register(
name="send_notification",
description="Send an email notification",
input_schema={...},
timeout_seconds=30.0, # External API, needs more time
)
def send_notification(to: str, subject: str, body: str) -> str:
...
When a timeout fires, it becomes a TimeoutError that the retry logic can catch.
Retry strategies
Not all errors deserve a retry. A validation error will fail the same way every time. A TimeoutError or a ConnectionError might succeed on the next attempt.
The registry implements linear backoff with a configurable retry count:
last_error = None
for attempt in range(tool.max_retries + 1):
try:
result = self._execute_with_timeout(tool, validated_input)
return ToolResult(
tool_use_id=tool_use_id,
content=result,
execution_time_ms=elapsed,
)
except Exception as e:
last_error = e
if attempt < tool.max_retries:
logger.warning(
f"Tool '{tool_name}' attempt {attempt + 1} failed: {e}. Retrying..."
)
time.sleep(0.5 * (attempt + 1)) # Linear backoff
# All retries exhausted
error_msg = f"Tool '{tool_name}' failed after {tool.max_retries + 1} attempts: {last_error}"
return ToolResult(
tool_use_id=tool_use_id,
content=f"Error: {error_msg}. Please try a different approach.",
is_error=True,
execution_time_ms=elapsed,
)
The backoff is 0.5 * (attempt + 1) seconds — 0.5s, 1.0s, 1.5s. Simple and predictable. For production systems hitting rate-limited APIs, you might want exponential backoff instead:
import random
delay = min(2 ** attempt + random.uniform(0, 1), 30)
time.sleep(delay)
The jitter (random component) prevents thundering herd problems when multiple agent instances retry the same service simultaneously.
Guidelines for retry configuration:
- Network calls (API requests, database queries):
max_retries=2or3 - Local computation (parsing, formatting):
max_retries=0— if it fails, it will fail again - External writes (sending emails, creating records):
max_retries=1with idempotency checks - Timeouts: always worth one retry, transient network issues are common
Formatting results for Claude
When Claude calls a tool, it includes an id field in the tool_use content block. Your result must reference this ID, or the API will reject it. The ToolResult dataclass handles this:
@dataclass
class ToolResult:
tool_use_id: str
content: str
is_error: bool = False
execution_time_ms: float = 0.0
The registry formats these into the exact structure Claude expects:
def format_results_for_claude(self, results: list[ToolResult]) -> list[dict]:
return [
{
"type": "tool_result",
"tool_use_id": r.tool_use_id,
"content": r.content,
**({"is_error": True} if r.is_error else {}),
}
for r in results
]
The is_error flag matters. When set to True, Claude knows the tool failed and will either try a different approach or explain the failure to the user. Without it, Claude might try to interpret an error message as a successful result.
Always return strings from your tool functions. If your tool naturally returns a dict or list, the registry serializes it with json.dumps. But explicit string formatting gives Claude better context:
# Bad — Claude gets raw JSON
def search_kb(query: str) -> dict:
return {"results": [...], "count": 3}
# Good — Claude gets readable text
def search_kb(query: str) -> str:
results = search(query)
if not results:
return "No articles found matching that query."
formatted = "\n\n".join(
f"**{r['title']}** (ID: {r['id']})\n{r['content']}"
for r in results
)
return f"Found {len(results)} article(s):\n\n{formatted}"
The agent loop pattern
Every agent follows the same core loop. Understanding it deeply is more valuable than any framework:
def run_agent(user_message: str, messages: list, registry: ToolRegistry) -> str:
messages.append({"role": "user", "content": user_message})
max_iterations = 10 # Safety valve
for _ in range(max_iterations):
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system="You are a helpful support agent.",
tools=registry.to_claude_tools(),
messages=messages,
)
messages.append({"role": "assistant", "content": response.content})
if response.stop_reason == "end_turn":
# Model is done — extract final text
text_blocks = [b.text for b in response.content if b.type == "text"]
return "\n".join(text_blocks)
# Extract tool calls from response
tool_calls = [
{"name": b.name, "input": b.input, "id": b.id}
for b in response.content
if b.type == "tool_use"
]
# Execute all tool calls
results = registry.execute_many(tool_calls)
# Format and send results back
tool_results = registry.format_results_for_claude(results)
messages.append({"role": "user", "content": tool_results})
return "I've reached my maximum number of steps. Please try rephrasing your question."
The critical elements:
-
max_iterationsas a safety valve. Without this, a confused model can loop forever. Ten iterations is generous for most agents — if it has not answered after ten tool calls, something is wrong. -
Append the full
response.contentto messages. Not just the text — the tool use blocks too. Claude needs to see its own tool calls in the conversation history to maintain coherence. -
Send tool results as a “user” message. This is Claude’s API convention. Tool results are wrapped in
{"role": "user", "content": [tool_results]}. -
Handle multiple tool calls per turn. Claude can request several tools at once.
execute_manyhandles them all and returns a list of results.
Common mistakes
Not handling tool errors. If your tool throws an unhandled exception, the agent loop crashes and the user gets nothing. Every tool should either return a string or let the registry catch and format the error.
Missing timeouts. A tool that calls an external API without a timeout will block your agent indefinitely. Set timeout_seconds on every tool that does I/O.
Infinite loops. Always use max_iterations in your agent loop. Also watch for “retry loops” where the model keeps calling the same tool with the same broken input. If you see the same tool called three times in a row with identical parameters, break out and return an error.
Swallowing the is_error flag. If you format tool results without is_error: True on failures, Claude has no signal that something went wrong. It will cheerfully treat an error traceback as search results.
Not logging tool execution. When something goes wrong at 2am, you need to know which tool was called, with what input, how long it took, and what it returned. Log every execution:
logger.info(
f"Tool '{tool_name}' | input={tool_input} | "
f"result_length={len(result.content)} | "
f"is_error={result.is_error} | "
f"time={result.execution_time_ms:.0f}ms"
)
Too many tools. Claude handles 5-10 tools well. At 20+ tools, it starts making worse decisions about which tool to call. If you have many tools, consider grouping them or using a two-stage approach where Claude first picks a category, then picks a tool.
What’s next
Tool calling is the mechanism. The real challenge is designing the right tools for your use case and testing that Claude uses them correctly. The agent evaluation guide covers how to write test cases that verify tool usage, and the cost tracking guide shows how to keep your agent loops from burning through your budget.
The StartToAgent starter kit has all of these patterns built out with the full ToolRegistry, Pydantic validation, timeout handling, and retry logic ready to use. If you want to skip building the infrastructure and focus on your agent’s actual tools, check out the kit.
Keep learning
Browse all guides