Platform

Use Cases

Resources

Pricing

Docs

For AI

Start for free

Book a demo

Platform

Use Cases

Resources

Pricing

Docs

For AI

Start for free

Book a demo

Guides

How to Optimize Tool Calling for AI Agents

To optimize tool calling, improve the full path from user request to tool selection, input generation, and task completion. The biggest gains usually come from clearer tool descriptions, simpler schemas, atomic tools, dynamic tool filtering, model-and-harness testing, and ongoing evals rather than simply switching to a larger model.

Garrett Scott

,

Head of Marketing

The challenge is reliability. There is a wide gap between an agent that has access to tools and an agent that selects the right tool, supplies the right inputs, completes the task, and does so at a cost suitable for production. Many teams discover this gap after a successful demo. The agent works in controlled examples, then real users ask messier questions, combine multiple requests, omit context, or use phrasing the system was not tested against.

What It Means to Optimize Tool Calling

Tool calling optimization should be measured across four separate dimensions.

Tool-Calling Metric	What It Measures	Why It Matters
Tool Correctness	Whether the agent selected the right tool for the request	Prevents the agent from using the wrong action, such as creating a record when it should search for one
Input Accuracy	Whether the agent passed the right arguments into the tool	Reduces failures caused by incorrect names, dates, IDs, filters, or query strings
Task Completion	Whether the full user request was completed successfully	Shows whether the agent actually solved the user’s problem, not just whether it made a tool call
Task Efficiency	How many tokens, calls, retries, and dollars were required	Helps teams control cost, latency, and production scalability

The first is tool correctness. This measures whether the agent selects the right tool for the task. If a user asks the agent to find a customer record, the agent should call the search or retrieval tool, not a create or update tool.

The second is input accuracy. This measures whether the agent extracts the correct arguments for the selected tool. An agent may choose the right tool but still fail by passing the wrong customer name, date range, workspace ID, or query string.

The third is task completion. This measures whether the full user request was completed successfully. Tool correctness and input accuracy matter because they support completion, but they do not always guarantee it. A tool may return incomplete data, require a follow-up action, or fail because the agent did not know how to continue.

The fourth is task efficiency. This measures how many tokens, tool calls, retries, and dollars were required to complete the task. A system completing a request but spending three times the required token budget may still be unsuitable for production at scale.

These metrics should be tracked separately. If you only measure whether the task worked, you will not know whether failures come from poor tool selection, bad argument extraction, weak tool design, missing permissions, or excessive context.

Optimizing Tool Calling for AI Agents

Start With Tool Design

When tool calling fails, many teams first consider upgrading to a larger model. This can help in some cases, but it is usually the wrong first lever. Tool design often has a larger impact because the model can only reason over the tools and schemas it receives.

A tool description is part of the prompt. If the description is too short, the model may not understand when to use the tool. If the description is too long, it adds token cost every time the tool is loaded and may distract the model with unnecessary detail.

The goal is a concise but informative description. It should explain what the tool does, when to use it, what it returns, and any important constraints. It should not include generic marketing language, irrelevant implementation detail, or lengthy examples unless they materially improve tool selection.

For example, “Search Notion” is too vague. “Searches connected Notion workspaces for pages matching a natural-language query and returns page titles, IDs, URLs, and matching snippets” gives the model enough context to choose the tool correctly.

Input design matters just as much. Engineers often expose tools in a way mirroring backend function arguments or raw API endpoints. This can work well for code, but it is harder for a language model. The model has to infer every argument from natural language, and each nested field adds another chance for failure.

Flatter schemas are usually better. A tool with one well-described query parameter can outperform a tool with deeply nested parameters, even if both eventually call the same backend API. The model’s job should be to express the user’s intent clearly. The server-side tool implementation can handle the API-specific details.

Wrap Multi-Step Workflows Into Atomic Tools

Many useful agent actions require more than one API call. Searching for a document may require one request to find matching files and another request to retrieve the file content. Creating a CRM follow-up task may require finding the contact, validating the account, and then creating the task.

If each API endpoint is exposed as a separate tool, the agent has to orchestrate the sequence. This creates more failure points. It may call the first tool correctly, misunderstand the result, skip the next step, or pass the wrong ID into the second call.

A better approach is to wrap common multi-step sequences into atomic tools. Instead of exposing “search page” and “get page contents” separately, expose a tool searching for a page and return its contents. Instead of exposing separate low-level CRM endpoints, expose a tool finding a contact and create a note or task in one operation.

Atomic tools reduce reasoning burden. They also make evaluation easier because the expected behavior is clearer. The agent does not need to know how the external API is structured. It only needs to understand the user’s task and choose the tool designed for the task.

The tradeoff is flexibility. Atomic tools are more opinionated than raw endpoint tools. They should be designed around common product workflows rather than every possible API operation. For production agents, the tradeoff is often worth it because reliability matters more than exposing every backend primitive to the model.

Load Only the Tools the Task Needs

Every tool placed in the model’s context has a cost. It consumes tokens, increases prompt size, and gives the model another possible option to consider. A large tool set can make an agent less accurate because the model must distinguish between many similar tools.

Tool filtering is one of the highest-impact optimizations for production agents. Instead of loading every available tool, load only the tools relevant to the current task, integration, user permissions, or conversation state.

If a user is working inside Salesforce, the agent may not need tools for Slack, Notion, Google Drive, and Zendesk. If the user asks to summarize documents, the agent may need retrieval tools but not write-action tools. If the user has not authenticated a specific integration, those tools should not be available at all.

This improves both accuracy and cost. The model sees fewer choices, so tool selection becomes easier. The prompt also becomes smaller, reducing input token usage. In agent systems with dozens or hundreds of possible actions, dynamic tool filtering is not optional. It is part of the reliability layer.

Match the Model to the Harness

Model choice still matters, but it should be evaluated together with the harness, tool schemas, descriptions, and task set. A frontier model is not automatically the best tool-calling model for every agent.

In our own testing, a small, cost-efficient model paired with well-designed tools completed the task about as often as a frontier model while using roughly 3x fewer input tokens. In some cases, they may be accurate enough for production at a much lower cost. In other cases, a more capable model may be justified because the task requires complex planning, ambiguity resolution, or multi-step reasoning.

The key is to test combinations rather than assume. A model can perform well with one provider’s tools but may perform differently with another provider’s schema style. A tool set with simple atomic actions may work well with a smaller model, while a raw API-style tool set may require more reasoning capacity.

Teams should evaluate model and harness combinations using their own tasks. Public benchmarks are useful for orientation, but production agents fail on the details of a specific workflow, user base, and integration set.

Build an Evaluation Loop

Tool-calling quality should be tested the way software behavior is tested. Without an evaluation suite, teams end up relying on demos, intuition, and manual spot checks. This is not enough for production agents.

A useful evaluation suite does not need to be large at first. Start with 15 to 30 representative tasks. Include straightforward prompts, vague prompts, edge cases, and prompts resembling real user behavior. Real users often omit context, use shorthand, combine requests, or ask for something the agent cannot safely complete.

Each test should define the expected tool call or tool sequence. This allows you to score tool correctness automatically. For input accuracy and task completion, an LLM-as-judge can help evaluate whether the agent passed the right arguments and completed the request.

The evaluation suite should run whenever you change tool descriptions, add new tools, modify schemas, update prompts, change routing logic, or switch models. Tool-calling performance can drift even when the product change seems small. A new tool may introduce confusion with an existing tool. A longer description may increase cost without improving accuracy. A model upgrade may change argument extraction behavior.

The goal is to turn tool-calling quality into a number you can move. Track pass rate, tool correctness, input accuracy, completion rate, token usage, latency, and cost per successful task.

High-level takeaways

The takeaways from our tests reveal that LLM choice has the largest impact on tool calling performance. This makes sense, as model providers like OpenAI and Anthropic have set their sights on building better models for tool calling.

The other settings saw either negligible or mixed effects on tool performance.

Why agents struggle with tools at scale

It’s easy to forget that tools are still just tokens at the end of the day. LLMs decide when to call a tool based off the tool name, tool description, input names, and input descriptions. The code that makes up the tool/function call is run in your backend (NOT ran by the model provider). The result of the code is returned to the LLM.

The tool calling process means that:

Tool descriptions eat up tokens in the context window
Two round trips are required - one for the initial tool call intent and another for the tool call results

As you provide your agent with more and more capabilities in the form of tools, you must also reconcile with tools eating up your context window and breakdowns from too many tools.

This may be a bit extreme, as your product may not need more than 80 tools. However, filtering and tool selection will have performance and cost benefits. Let’s talk about how to engineer an agent system that can handle an increasing array of tools.

How to implement tools at scale

If tools eat up tokens and hog the context window, the fix is to load only the relevant tools. Here are a few designs that work:

Let users decide

How many times has one of your users said, “Whoa I didn’t know the product could do that!” You’ve probably experienced it yourself when you find a new iPhone setting or keyboard hotkey.

Letting users decide what tools to give your agent not only limits the number of tools, but also makes it apparent to your users what tools are available.

Using an example where an agent is using ActionKit to provide tools for different integration providers:

Multi-agent pattern

If you want your agent to decide on the right subset of tools needed to complete the user’s prompt, patterns like routing and orchestration can filter tools for your agent (3).

In the planner-worker implementation, a “planner” agent creates a plan on what integrations are necessary for the task. For example, if a user asks for their email inbox, the planner agent will respond with ['gmail'] . If a user asks about Salesforce and Gmail, the planner agent will respond with ['salesforce', 'gmail'] .

export async function planWork(integrations: Array<string>, messages: Array<ModelMessage>) {
	const objectPrompt: ModelMessage = {
		role: "user",
		content: `Decide what integrations are needed to complete the task. 
			For generic requests, do not include any integrations.`
	}
	const objectMessages = [...messages, objectPrompt];
	const { object: integrationPlan } = await generateObject({
		model: openai('gpt-5-nano'),
		schema: z.object({
			integrations: integrations.length > 0 ? z.array(z.enum(integrations as [string, ...string[]])) : z.array(z.string()),
			integrationSpecificPrompt: z.array(z.string()),
		}),
		system: `You have access to these integrations: ${integrations.join()}`,
		messages: objectMessages
	});
	return integrationPlan
}

export async function planWork(integrations: Array<string>, messages: Array<ModelMessage>) {
	const objectPrompt: ModelMessage = {
		role: "user",
		content: `Decide what integrations are needed to complete the task. 
			For generic requests, do not include any integrations.`
	}
	const objectMessages = [...messages, objectPrompt];
	const { object: integrationPlan } = await generateObject({
		model: openai('gpt-5-nano'),
		schema: z.object({
			integrations: integrations.length > 0 ? z.array(z.enum(integrations as [string, ...string[]])) : z.array(z.string()),
			integrationSpecificPrompt: z.array(z.string()),
		}),
		system: `You have access to these integrations: ${integrations.join()}`,
		messages: objectMessages
	});
	return integrationPlan
}

export async function planWork(integrations: Array<string>, messages: Array<ModelMessage>) {
	const objectPrompt: ModelMessage = {
		role: "user",
		content: `Decide what integrations are needed to complete the task. 
			For generic requests, do not include any integrations.`
	}
	const objectMessages = [...messages, objectPrompt];
	const { object: integrationPlan } = await generateObject({
		model: openai('gpt-5-nano'),
		schema: z.object({
			integrations: integrations.length > 0 ? z.array(z.enum(integrations as [string, ...string[]])) : z.array(z.string()),
			integrationSpecificPrompt: z.array(z.string()),
		}),
		system: `You have access to these integrations: ${integrations.join()}`,
		messages: objectMessages
	});
	return integrationPlan
}

export async function planWork(integrations: Array<string>, messages: Array<ModelMessage>) {
	const objectPrompt: ModelMessage = {
		role: "user",
		content: `Decide what integrations are needed to complete the task. 
			For generic requests, do not include any integrations.`
	}
	const objectMessages = [...messages, objectPrompt];
	const { object: integrationPlan } = await generateObject({
		model: openai('gpt-5-nano'),
		schema: z.object({
			integrations: integrations.length > 0 ? z.array(z.enum(integrations as [string, ...string[]])) : z.array(z.string()),
			integrationSpecificPrompt: z.array(z.string()),
		}),
		system: `You have access to these integrations: ${integrations.join()}`,
		messages: objectMessages
	});
	return integrationPlan
}

Based off the plan with a list of integrations, ActionKit builds tools dynamically - providing the right descriptions and input schema for different actions across an integration like Gmail.

tool({
	description: toolFunction.function.description,
	inputSchema: jsonSchema(toolFunction.function.parameters),
	execute: async (params: any) => {
			const response = await fetch(
				`https://actionkit.useparagon.com/projects/<project_id>/actions`,
				{
					method: "POST",
					body: JSON.stringify({
						action: toolFunction.function.name,
						parameters: params,
					}),
					headers: {
						Authorization: `Bearer ${paragonUserToken}`,
						"Content-Type": "application/json",
					},
				}
			);
			const output = await response.json();
			if (!response.ok) {
				throw new Error(JSON.stringify(output, null, 2));
			}
			return output;
})

tool({
	description: toolFunction.function.description,
	inputSchema: jsonSchema(toolFunction.function.parameters),
	execute: async (params: any) => {
			const response = await fetch(
				`https://actionkit.useparagon.com/projects/<project_id>/actions`,
				{
					method: "POST",
					body: JSON.stringify({
						action: toolFunction.function.name,
						parameters: params,
					}),
					headers: {
						Authorization: `Bearer ${paragonUserToken}`,
						"Content-Type": "application/json",
					},
				}
			);
			const output = await response.json();
			if (!response.ok) {
				throw new Error(JSON.stringify(output, null, 2));
			}
			return output;
})

tool({
	description: toolFunction.function.description,
	inputSchema: jsonSchema(toolFunction.function.parameters),
	execute: async (params: any) => {
			const response = await fetch(
				`https://actionkit.useparagon.com/projects/<project_id>/actions`,
				{
					method: "POST",
					body: JSON.stringify({
						action: toolFunction.function.name,
						parameters: params,
					}),
					headers: {
						Authorization: `Bearer ${paragonUserToken}`,
						"Content-Type": "application/json",
					},
				}
			);
			const output = await response.json();
			if (!response.ok) {
				throw new Error(JSON.stringify(output, null, 2));
			}
			return output;
})

tool({
	description: toolFunction.function.description,
	inputSchema: jsonSchema(toolFunction.function.parameters),
	execute: async (params: any) => {
			const response = await fetch(
				`https://actionkit.useparagon.com/projects/<project_id>/actions`,
				{
					method: "POST",
					body: JSON.stringify({
						action: toolFunction.function.name,
						parameters: params,
					}),
					headers: {
						Authorization: `Bearer ${paragonUserToken}`,
						"Content-Type": "application/json",
					},
				}
			);
			const output = await response.json();
			if (!response.ok) {
				throw new Error(JSON.stringify(output, null, 2));
			}
			return output;
})

A worker agent then uses only the integration-specific tools in its request to the OpenAI API.

const result = streamText({
	model: openai('gpt-5-nano'),
	system: `You MUST use the available tools to help with the user's request.
	Do not just describe what you would do - actually call the tools! Do NOT forget inputs.`,
	messages: revisedMessages,
	stopWhen: stepCountIs(5),
	tools: toolsForIntegration,
});

const result = streamText({
	model: openai('gpt-5-nano'),
	system: `You MUST use the available tools to help with the user's request.
	Do not just describe what you would do - actually call the tools! Do NOT forget inputs.`,
	messages: revisedMessages,
	stopWhen: stepCountIs(5),
	tools: toolsForIntegration,
});

const result = streamText({
	model: openai('gpt-5-nano'),
	system: `You MUST use the available tools to help with the user's request.
	Do not just describe what you would do - actually call the tools! Do NOT forget inputs.`,
	messages: revisedMessages,
	stopWhen: stepCountIs(5),
	tools: toolsForIntegration,
});

const result = streamText({
	model: openai('gpt-5-nano'),
	system: `You MUST use the available tools to help with the user's request.
	Do not just describe what you would do - actually call the tools! Do NOT forget inputs.`,
	messages: revisedMessages,
	stopWhen: stepCountIs(5),
	tools: toolsForIntegration,
});

End-to-end, the agent system loads only relevant tools based off the user prompt.

MCP provided tools

MCPs are similar to multi-agent patterns. Rather than a “planner” agent that lives in your application backend, an MCP server can decide on tools and dynamically provide them to your agent.

An important distinction here: MCP servers do NOT inherently provide tool selection out-of-the-box. The Model Context Protocol is a standard for providing your agent with prompts, sampling, and tools. MCP servers can help select tools, but the standard does not require an MCP server to.

In summary, don’t rely on 3rd-party MCP servers to solve tool selection for your agent. You can build your own MCP server that handles tool selection, or opt for an agent pattern like worker-planner to handle the MCP-provided tools.

Using Paragon’s ActionKit MCP server as an example, we used the same multi-agent pattern to plan and select tools, but what the ActionKit MCP server can uniquely do is provide server-specific context, such as magic links that authenticate users directly in the chat.

Check out the ActionKit MCP server to try out different 3rd-party integration tools right in your Cursor, Claude, or very own MCP client.

Secure Tool Calling Matters

Tool calling gives an agent the ability to act in external systems, so security cannot be treated as a later concern. The agent should only have access to tools the user is allowed to use and data the user is allowed to access.

For SaaS integrations, this usually means handling OAuth, scopes, token refresh, and user-level or tenant-level authorization. If an agent is acting inside a customer’s Salesforce, Google Drive, Slack, or Zendesk account, the system must know which user authorized the connection and what users can do.

Read and write actions should be separated. A retrieval tool that searches documents should not require broad write permissions. A tool used to update CRM records should be loaded only when the user request requires it and the user has the correct authorization.

This is especially important for B2B SaaS products building agents for their own customers. Each customer may connect their own workspace, with different permissions and administrators. The agent architecture needs to prevent cross-tenant access and avoid exposing tools tied to disconnected or expired accounts.

Where Build Versus Buy Fits

Teams can build their own tool-calling layer. For one or two integrations, this can be reasonable. You can write the schemas, design the descriptions, wrap API sequences, manage authentication, handle errors, and evaluate performance yourself.

The cost changes as the number of integrations and actions grows. Each new SaaS app introduces a new authentication model, API structure, rate limit pattern, permission system, and set of edge cases. Each tool needs to be described in a way the model can use reliably. Each change needs to be tested. Over time, the integration layer can become a standing engineering commitment.

A tool provider helps by packaging third-party actions into agent-usable tools and handling parts of the authentication and integration layer. Paragon’s ActionKit is designed for AI agents and product workflows that need to take actions across third-party SaaS apps. It exposes third-party actions as JSON Schema tools and supports adoption through API-based workflows or an MCP server, depending on the stack.

This does not remove the need to evaluate your own agent. Tool-calling performance still depends on the model, prompts, task design, and product context. But for teams supporting many SaaS integrations, using infrastructure built for agent actions can reduce the amount of connector and tool maintenance required internally.

Frequently Asked Questions

What does it mean to optimize tool calling?

Optimizing tool calling means improving how reliably an AI agent selects the right tool, supplies the correct inputs, completes the user’s task, and does so at an acceptable token and cost budget. These are related but separate dimensions.

Does a bigger model fix tool-calling problems?

Not usually as the first step. Larger models can help with complex reasoning, but poor tool descriptions, overly complex schemas, too many tools in context, and weak evaluation practices often cause the biggest failures.

Why does my agent pick the wrong tool?

Agents often pick the wrong tool because the available tools are too similar, the descriptions are vague, or too many irrelevant tools are loaded into context. Clear descriptions and dynamic tool filtering usually improve selection.

Why does my agent pass the wrong inputs?

Input errors often come from schemas that are too nested, too abstract, or too closely modeled on backend API arguments. Flatter inputs and more task-oriented tools make argument extraction easier for the model.

How do I reduce token cost in tool calling?

Reduce the number of tools loaded into context and keep descriptions concise while still giving the model enough information to choose correctly. Avoid loading disconnected integrations, irrelevant actions, or long descriptions for tools unrelated to the task.

Conclusion

Tool calling improves when the system is designed around how language models actually choose tools and extract arguments. The highest-impact changes are usually practical: write better descriptions, simplify schemas, wrap multi-step API calls into atomic tools, filter tools dynamically, and evaluate every meaningful change.

For teams building AI agents across many SaaS integrations, the core question is whether maintaining agent-ready tools is the best use of engineering time. If your product needs reliable third-party actions with managed authentication across customer-connected apps, Paragon is worth evaluating.

Table of contents will appear here.