Guides
How to Optimize Tool Calling for AI Agents
To optimize tool calling, improve the full path from user request to tool selection, input generation, and task completion. The biggest gains usually come from clearer tool descriptions, simpler schemas, atomic tools, dynamic tool filtering, model-and-harness testing, and ongoing evals rather than simply switching to a larger model.

Garrett Scott
,
Head of Marketing
To optimize tool calling, improve the full path from user request to tool selection, input generation, and task completion. The biggest gains usually come from clearer tool descriptions, simpler schemas, atomic tools, dynamic tool filtering, model-and-harness testing, and ongoing evals rather than simply switching to a larger model.
The challenge is reliability. There is a wide gap between an agent that has access to tools and an agent that selects the right tool, supplies the right inputs, completes the task, and does so at a cost suitable for production. Many teams discover this gap after a successful demo. The agent works in controlled examples, then real users ask messier questions, combine multiple requests, omit context, or use phrasing the system was not tested against.
What It Means to Optimize Tool Calling
Tool calling optimization should be measured across four separate dimensions.
Tool-Calling Metric | What It Measures | Why It Matters |
|---|---|---|
Tool Correctness | Whether the agent selected the right tool for the request | Prevents the agent from using the wrong action, such as creating a record when it should search for one |
Input Accuracy | Whether the agent passed the right arguments into the tool | Reduces failures caused by incorrect names, dates, IDs, filters, or query strings |
Task Completion | Whether the full user request was completed successfully | Shows whether the agent actually solved the user’s problem, not just whether it made a tool call |
Task Efficiency | How many tokens, calls, retries, and dollars were required | Helps teams control cost, latency, and production scalability |
The first is tool correctness. This measures whether the agent selects the right tool for the task. If a user asks the agent to find a customer record, the agent should call the search or retrieval tool, not a create or update tool.
The second is input accuracy. This measures whether the agent extracts the correct arguments for the selected tool. An agent may choose the right tool but still fail by passing the wrong customer name, date range, workspace ID, or query string.
The third is task completion. This measures whether the full user request was completed successfully. Tool correctness and input accuracy matter because they support completion, but they do not always guarantee it. A tool may return incomplete data, require a follow-up action, or fail because the agent did not know how to continue.
The fourth is task efficiency. This measures how many tokens, tool calls, retries, and dollars were required to complete the task. A system completing a request but spending three times the required token budget may still be unsuitable for production at scale.
These metrics should be tracked separately. If you only measure whether the task worked, you will not know whether failures come from poor tool selection, bad argument extraction, weak tool design, missing permissions, or excessive context.
Optimizing Tool Calling for AI Agents
Start With Tool Design
When tool calling fails, many teams first consider upgrading to a larger model. This can help in some cases, but it is usually the wrong first lever. Tool design often has a larger impact because the model can only reason over the tools and schemas it receives.
A tool description is part of the prompt. If the description is too short, the model may not understand when to use the tool. If the description is too long, it adds token cost every time the tool is loaded and may distract the model with unnecessary detail.
The goal is a concise but informative description. It should explain what the tool does, when to use it, what it returns, and any important constraints. It should not include generic marketing language, irrelevant implementation detail, or lengthy examples unless they materially improve tool selection.
For example, “Search Notion” is too vague. “Searches connected Notion workspaces for pages matching a natural-language query and returns page titles, IDs, URLs, and matching snippets” gives the model enough context to choose the tool correctly.
Input design matters just as much. Engineers often expose tools in a way mirroring backend function arguments or raw API endpoints. This can work well for code, but it is harder for a language model. The model has to infer every argument from natural language, and each nested field adds another chance for failure.
Flatter schemas are usually better. A tool with one well-described query parameter can outperform a tool with deeply nested parameters, even if both eventually call the same backend API. The model’s job should be to express the user’s intent clearly. The server-side tool implementation can handle the API-specific details.
Wrap Multi-Step Workflows Into Atomic Tools
Many useful agent actions require more than one API call. Searching for a document may require one request to find matching files and another request to retrieve the file content. Creating a CRM follow-up task may require finding the contact, validating the account, and then creating the task.
If each API endpoint is exposed as a separate tool, the agent has to orchestrate the sequence. This creates more failure points. It may call the first tool correctly, misunderstand the result, skip the next step, or pass the wrong ID into the second call.
A better approach is to wrap common multi-step sequences into atomic tools. Instead of exposing “search page” and “get page contents” separately, expose a tool searching for a page and return its contents. Instead of exposing separate low-level CRM endpoints, expose a tool finding a contact and create a note or task in one operation.
Atomic tools reduce reasoning burden. They also make evaluation easier because the expected behavior is clearer. The agent does not need to know how the external API is structured. It only needs to understand the user’s task and choose the tool designed for the task.
The tradeoff is flexibility. Atomic tools are more opinionated than raw endpoint tools. They should be designed around common product workflows rather than every possible API operation. For production agents, the tradeoff is often worth it because reliability matters more than exposing every backend primitive to the model.
Load Only the Tools the Task Needs
Every tool placed in the model’s context has a cost. It consumes tokens, increases prompt size, and gives the model another possible option to consider. A large tool set can make an agent less accurate because the model must distinguish between many similar tools.
Tool filtering is one of the highest-impact optimizations for production agents. Instead of loading every available tool, load only the tools relevant to the current task, integration, user permissions, or conversation state.
If a user is working inside Salesforce, the agent may not need tools for Slack, Notion, Google Drive, and Zendesk. If the user asks to summarize documents, the agent may need retrieval tools but not write-action tools. If the user has not authenticated a specific integration, those tools should not be available at all.
This improves both accuracy and cost. The model sees fewer choices, so tool selection becomes easier. The prompt also becomes smaller, reducing input token usage. In agent systems with dozens or hundreds of possible actions, dynamic tool filtering is not optional. It is part of the reliability layer.
Match the Model to the Harness
Model choice still matters, but it should be evaluated together with the harness, tool schemas, descriptions, and task set. A frontier model is not automatically the best tool-calling model for every agent.
In our own testing, a small, cost-efficient model paired with well-designed tools completed the task about as often as a frontier model while using roughly 3x fewer input tokens. In some cases, they may be accurate enough for production at a much lower cost. In other cases, a more capable model may be justified because the task requires complex planning, ambiguity resolution, or multi-step reasoning.
The key is to test combinations rather than assume. A model can perform well with one provider’s tools but may perform differently with another provider’s schema style. A tool set with simple atomic actions may work well with a smaller model, while a raw API-style tool set may require more reasoning capacity.
Teams should evaluate model and harness combinations using their own tasks. Public benchmarks are useful for orientation, but production agents fail on the details of a specific workflow, user base, and integration set.
Build an Evaluation Loop
Tool-calling quality should be tested the way software behavior is tested. Without an evaluation suite, teams end up relying on demos, intuition, and manual spot checks. This is not enough for production agents.
A useful evaluation suite does not need to be large at first. Start with 15 to 30 representative tasks. Include straightforward prompts, vague prompts, edge cases, and prompts resembling real user behavior. Real users often omit context, use shorthand, combine requests, or ask for something the agent cannot safely complete.
Each test should define the expected tool call or tool sequence. This allows you to score tool correctness automatically. For input accuracy and task completion, an LLM-as-judge can help evaluate whether the agent passed the right arguments and completed the request.
The evaluation suite should run whenever you change tool descriptions, add new tools, modify schemas, update prompts, change routing logic, or switch models. Tool-calling performance can drift even when the product change seems small. A new tool may introduce confusion with an existing tool. A longer description may increase cost without improving accuracy. A model upgrade may change argument extraction behavior.
The goal is to turn tool-calling quality into a number you can move. Track pass rate, tool correctness, input accuracy, completion rate, token usage, latency, and cost per successful task.
Secure Tool Calling Matters
Tool calling gives an agent the ability to act in external systems, so security cannot be treated as a later concern. The agent should only have access to tools the user is allowed to use and data the user is allowed to access.
For SaaS integrations, this usually means handling OAuth, scopes, token refresh, and user-level or tenant-level authorization. If an agent is acting inside a customer’s Salesforce, Google Drive, Slack, or Zendesk account, the system must know which user authorized the connection and what users can do.
Read and write actions should be separated. A retrieval tool that searches documents should not require broad write permissions. A tool used to update CRM records should be loaded only when the user request requires it and the user has the correct authorization.
This is especially important for B2B SaaS products building agents for their own customers. Each customer may connect their own workspace, with different permissions and administrators. The agent architecture needs to prevent cross-tenant access and avoid exposing tools tied to disconnected or expired accounts.
Where Build Versus Buy Fits
Teams can build their own tool-calling layer. For one or two integrations, this can be reasonable. You can write the schemas, design the descriptions, wrap API sequences, manage authentication, handle errors, and evaluate performance yourself.
The cost changes as the number of integrations and actions grows. Each new SaaS app introduces a new authentication model, API structure, rate limit pattern, permission system, and set of edge cases. Each tool needs to be described in a way the model can use reliably. Each change needs to be tested. Over time, the integration layer can become a standing engineering commitment.
A tool provider helps by packaging third-party actions into agent-usable tools and handling parts of the authentication and integration layer. Paragon’s ActionKit is designed for AI agents and product workflows that need to take actions across third-party SaaS apps. It exposes third-party actions as JSON Schema tools and supports adoption through API-based workflows or an MCP server, depending on the stack.
This does not remove the need to evaluate your own agent. Tool-calling performance still depends on the model, prompts, task design, and product context. But for teams supporting many SaaS integrations, using infrastructure built for agent actions can reduce the amount of connector and tool maintenance required internally.
Frequently Asked Questions
What does it mean to optimize tool calling?
Optimizing tool calling means improving how reliably an AI agent selects the right tool, supplies the correct inputs, completes the user’s task, and does so at an acceptable token and cost budget. These are related but separate dimensions.
Does a bigger model fix tool-calling problems?
Not usually as the first step. Larger models can help with complex reasoning, but poor tool descriptions, overly complex schemas, too many tools in context, and weak evaluation practices often cause the biggest failures.
Why does my agent pick the wrong tool?
Agents often pick the wrong tool because the available tools are too similar, the descriptions are vague, or too many irrelevant tools are loaded into context. Clear descriptions and dynamic tool filtering usually improve selection.
Why does my agent pass the wrong inputs?
Input errors often come from schemas that are too nested, too abstract, or too closely modeled on backend API arguments. Flatter inputs and more task-oriented tools make argument extraction easier for the model.
How do I reduce token cost in tool calling?
Reduce the number of tools loaded into context and keep descriptions concise while still giving the model enough information to choose correctly. Avoid loading disconnected integrations, irrelevant actions, or long descriptions for tools unrelated to the task.
Conclusion
Tool calling improves when the system is designed around how language models actually choose tools and extract arguments. The highest-impact changes are usually practical: write better descriptions, simplify schemas, wrap multi-step API calls into atomic tools, filter tools dynamically, and evaluate every meaningful change.
For teams building AI agents across many SaaS integrations, the core question is whether maintaining agent-ready tools is the best use of engineering time. If your product needs reliable third-party actions with managed authentication across customer-connected apps, Paragon is worth evaluating.




