RAG Best Practices

Optimizing Tool Calling

Learn what you can do to optimize tool calling for your AI agent

This article has been updated with the release of new LLM models, specifically an update to o3 on April 16, 2025

Tools for LLMs have turned AI applications from question-answer machines to agents, able to perform actions ranging from searching the web to calling 3rd-party APIs to query, create, and update data. On the surface, tools for agents isn’t that hard; define a function in code and provide a good description, and your AI agent can use that tool to call APIs and perform custom logic. However, in practice tool calling can be inaccurate and hard to debug. That’s the problem we wanted to solve with a series of AI agent evaluations, which we'll cover in this article.

As a result of working with many enterprise AI customers, our team wanted to answer questions such as:

  • How do we improve tool calling?

  • Does changing the system prompt help?

  • Would different agent architectures improve tool calling?

  • The list goes on…

In summary, we wanted to understand how different levers would affect tool calling performance, and ultimately how best to optimize tool calling.

Background on Tool Calling

Before we go further, let’s first quickly refresh our memories of how agent tool calling works (feel free to scroll past this section and on to the “Methodology” section if you’re already familiar).

Tool calling is how AI agents can call custom code that extend an LLM’s baseline chat capabilities. In our querying vs data ingestion article, examples included tools to allow your agent to query Salesforce or search messages in Slack.

Tool calling can also be used to write data in 3rd-party platforms, such as sending a message or creating a contact in Salesforce.

Implementing tools starts with providing the following JSON format to an LLM. The JSON schema is standard across different LLM providers such as OpenAI and Anthropic which makes it easy to slot in the same tools across different models. Notice that the tool metadata JSON includes names, descriptions, and parameters - necessary for an AI agent to decide when to call the tool, what inputs are needed, and how to pick out the inputs from a natural language prompt.

{
    "name": "SALESFORCE_SEARCH_RECORDS_CONTACT",
    "description": "Triggered when a user wants to Search Contact in Salesforce",
    "parameters": {
        "type": "object",
        "properties": {
            "filterFormula": {
                "type": "string",
                "description": "Filter search : Search for records that match specified filters."
            },
            "includeAllFields": {
                "type": "boolean",
                "description": "Include All Fields"
            },
            "paginationParameters": {
                "type": "object",
                "description": "Pagination parameters for paginated results",
                "properties": {
                    "pageCursor": {
                        "type": "string",
                        "description": "The cursor indicating the current page"
                    }
                },
                "required": [],
                "additionalProperties": false
            }
        },
        "required": [],
        "additionalProperties": false

When our LLM decides that a tool call is necessary, we can then return a tool message with the output of the custom code that corresponds with the tool call. This is how the custom code is handed back to an LLM for synthesizing a response.

In our custom function implementation, we used ActionKit which is an API that gives AI agents access to 1000+ 3rd-party API tools. With ActionKit, creating and calling tools can be done programmatically, rather than manually like in our example below.

for tool_call in message.tool_calls:
  currentTime = time.time()
  encoded_jwt = jwt.encode({
      "sub": "username",
      "iat": currentTime,
      "exp": currentTime + (60 * 60 * 24 * 7)
  }, os.environ['PARAGON_SIGNING_KEY'].replace("\\\\n", "\\n"), algorithm="RS256")

  run_actions_body = {
      "action": tool_call["name"],
      "parameters": tool_call["args"]
  }
  response = requests.post("<https://actionkit.useparagon.com/projects/>" + 
													  os.environ['PARAGON_PROJECT_ID'] + "/actions",
						  headers={"Authorization": "Bearer " + encoded_jwt}, 
						  json=run_actions_body)
  tool_result = response.json()

  outputs.append(
      ToolMessage(
          content=json.dumps(tool_result),
          name=tool_call["name"],
          tool_call_id=tool_call["id"],
      )
  )

Now that we have a good picture of what tool calling is and how tools are implemented, let’s dive into how to optimize tool calling!

Methodology: A Quantitative Approach

When we talk about how to optimize tool calling, we first need to define how to measure tool calling performance.

For our evaluations, we decided on:

  1. Tool Correctness: does our agent pick the correct tool for the given task

  2. Task Completion: does our agent use the tool correctly to complete the task

Here’s an example using the table below.

  1. In the first test case, our agent failed in both tool correctness and task completion as it was unable to use a Salesforce query tool for its intended purpose.

  2. In the second test case, our agent used the correct tool, but whether it was because of improper inputs or an error thrown by the 3rd-party API, our agent failed to complete the task given in the original prompt.

Evaluation Framework

We used DeepEval’s testing framework for evaluating metrics across 50 test cases designed for tool calls. Each metric is scored on a scale from 0 to 1, and is calculated in the following ways:

Tool Correctness

Task Completion

Task completion is an LLM scored metric that intakes an agent response, tool calls, tool inputs, tool outputs, and the original prompt to evaluate if the intended task was completed.

Test Cases

With the metrics explained, let's move on to our test cases. Our team put together a series of 50 test cases.

  • Each test case has a prompt and the expected tools to be used

  • Across our 50 test cases, we evaluated six different 3rd-party providers

    • CRMs: Salesforce and Hubspot

    • Messaging: Slack and Gmail

    • File Storage: Google Drive and Notion

  • 36 of our 50 test cases were single tool tasks

    • i.e.) What files do I have in my Google Drive?

  • 14 of our 50 test cases involved chaining/multiple tools

    • i.e.) Are there any documents in my notion and drive about similar topics?

Here are some more examples from our test bank:

Testing Axes: What should we improve?

I mentioned that this research started as a series of questions which all boil down to how to improve tool calling performance. However, this question needs to be broken down to a series of sub-questions that make up our axes.

  • What’s the best LLM for tool calling?

    • Is it OpenAI’s 4o or o3?

    • Is it Claude’s 3.5 Sonnet?

  • Do better system prompts improve tool calling?

  • How does having more/less tools affect an agent’s ability to properly call tools?

  • How much does better tool descriptions improve performance?

Each question became an axis we evaluated against.

We approached these axes from easiest to hardest to implement when using an agent application framework. It’s fairly easy to switch out LLMs in-and-out of your application framework. It takes more effort to route questions to different agents with different numbers of tools. It’s extremely manual to write better tool descriptions per tool.

We'll cover the lift each axis had on tool calling performance, and whether or not the effort to improve each axis was worth the incremental results.

Evaluation Results

Each axis was evaluated independently, meaning we weren’t trying to build the best performing model - instead, the goal was to distinguish which axis would achieve the most performance lift. The following is the baseline we established:

OpenAI’s gpt-4o model with the following system prompt:


OpenAI’s gpt-4o was selected because of its wide adoption and because we wanted to see if “reasoning” models would perform better than a non-reasoning model like OpenAI’s gpt-4o when it comes to tool calling. Our system prompt was also written in a way that while not descriptive, guided our agent to immediately call a tool. This is due to the nature of our automated testing framework, which required us to evaluate our agent’s immediate response to a prompt (one shot) in contrast to a back-and-forth chat interaction.

LLMs:

For our first axis, we originally evaluated 3 LLMs - OpenAI’s gpt-4o, OpenAI’s o3-mini, and Claude 3.5-sonnet. Across our 50 test cases for tool correctness and task completion, gpt-4o outperformed the others in terms of selecting the right tool. However 3.5-Sonnet and o3-mini outperformed in terms of actually completing the intended task properly.

The results:

LLM

Tool Correctness

Task Completion

3.5-sonnet

67.6%

65%

o3-mini

68.6%

61.1%

gpt-4o

74.8%

53.0%

This is an interesting result because it signals that when 3.5-sonnet and o3-mini pick the right tool, they usually complete the task as well. While GPT-4o picks the right tool more often, it has trouble executing tasks properly. This was evident in test cases where GPT-4o needed to generate SQL.

Contrast that with 3.5-sonnet and o3-mini’s responses to the same prompts.

Additionally, when looking deeper into why 3.5-sonnet and o3-mini were outperformed by gpt-4o in tool correctness, we noticed that at times it was due to followup questions rather than using a tool immediately. Here, we have some examples from o3-mini:

Update:

Since our original evaluation, OpenAI has released newer models including updates to their o3 model, touting it as their most powerful reasoning model. We ran our evaluation with the new o3 snapshot o3-2025-04-16 and saw significant performance improvement in tool calling compared to its predecessors.

LLM

Tool Correctness

Task Completion

3.5-sonnet

67.6%

65%

o3-mini

68.6%

61.1%

gpt-4o

74.8%

53.0%

o3-2025-04-16

77.9%

69.6%

The new o3 model is a clear winner in both tool correctness and task completion, signaling that tool calling performance is not at its limit yet.

  • Newer models will only continue to improve at tool use making tool calling optimization an iterative exercise as new models and methods emerge.

Tool use will only become more widespread as newer models are able to use tools more reliably

System Prompt

For this axis, we tested our base system prompt against a light system prompt against a descriptive system prompt.

The light system prompt:



Helps users perform actions in 3rd-party applications. Users will ask to create records and search for records in CRMs like Salesforce and Hubspot. Users will also ask you to send messages/emails and search for messages/emails in messaging platforms like Slack and Gmail. Lastly, users will ask to search through pages and documents like in Google Drive and Notion.

The descriptive system prompt:



</aside>

Overall results:

System Prompt

Tool Correctness

Task Completion

Base

74.8%

53.0%

Light

76.8%

52.1%

Descriptive

72.9%

53.8%

Based on the sample size, these marginal differences weren’t significant. However, the results become more interesting when isolating the outcomes from test cases that were more complex and required multiple tool calls.

Multi-tool test cases:

System Prompt

Tool Correctness

Task Completion

Base

44.1%

37.5%

Light

51.2%

42.9%

Descriptive

51.2%

53.9%

For more complex test cases, we see more of a positive impact with more descriptive system prompts. While the sample size for multi-tool test cases is smaller as it’s just a subset of our test cases, I was able to see that the more descriptive system prompted agents were more likely to try multiple tools. In the example below, we see that rules and examples can help agents try multiple tools and retry tools which is necessary for more complex use cases.


Agent Routing and Number of Tools

This axis was one that we were most excited to evaluate. With concepts like agent swarms, agent architectures with guidance from leaders like Anthropic, and directed agent workflows from frameworks like LangGraph, we wanted to see the impact a more sophisticated agent implementation could bring.

For this evaluation, we only tested the “Routing” architecture where prompts are “routed” to specific agents that specialize in a certain type of task.

In our case, we spun up multiple agents and equipped each agent with only ~5 tools from the same integration compared to the ~20 tools that a single agent had to handle in our previous axes. This tested not only the “Routing” architecture but also how an agent performs with a smaller number of tools (5 vs 20 different tools).

GPT-4o results:

Routing

Tool Correctness

Task Completion

Routing

73.3%

55.8%

No-Routing

74.8%

53.0%

Routing didn’t seem to have much effect on the gpt-4o model. This actually made sense given that where the gpt-4o model fell short (in our LLM evaluations) was not in selecting the right tool, but in using a tool to complete the task.

Routing to an agent with less tools should naturally impact tool correctness as there are less tools for an agent to choose from when a tool intent is present in a prompt. We didn’t see this with gpt-4o (perhaps because tool correctness is fairly high already), but we saw more positive impact when testing routing with sonnet-3.5.

3.5-sonnet results:

Routing

Tool Correctness

Task Completion

Routing

75.8%

60.3%

No-Routing

67.6%

65%

While it was surprising to see that task completion performance dropped, agent routing resulted in a significant impact on tool correctness - a weakness of 3.5-sonnet in the original evaluation. What this tells us is that different agent architectures and routing may not be catch-all solution, but can address specific weaknesses of an AI agent.

In future evals, we will explore leveraging the best LLM for each step of the process such as using GPT-4o for tool selection and 3.5-sonnet for tool use. Click here to subscribe for future updates.

Description Quality

The last axis we tested was description quality where we tested regular tool descriptions (including parameter descriptions) against extra-detailed descriptions.

In our baseline tool description, we used ActionKit’s default descriptions which are already optimized for AI agent-use with concise wording and examples when relevant.

{
    "type": "function",
    "function": {
        "name": "NOTION_GET_PAGE_CONTENT",
        "description": "Triggered when a user wants to get a page content in Notion",
        "parameters": {
            "type": "object",
            "properties": {
                "blockId": {
                    "type": "string",
                    "description": "Page ID : Specify a Block or Page ID to receive all of its block\\u2019s children in order. (example: \\"59833787-2cf9-4fdf-8782-e53db20768a5\\")"
                }
            },
            "required": [
                "blockId"
            ],
            "additionalProperties": false

In our extra-detailed descriptions, we followed best practices found in resources like LangChain’s best practices for tool calling and LlamaIndex’s guidance on better tools for agents. Most of our original ActionKit tools had concise names and descriptions, but for each tool in our “extra-detailed” descriptions, we added additional examples and made sure to include a return schema as well in the description.

{
    "type": "function",
    "function": {
        "name": "HUBSPOT_SEARCH_RECORDS_CONTACTS",
        "description": "Triggered when a user wants to Search contacts in Hubspot. \\n Returns: an array of contact records",
        "parameters": {
            "type": "object",
            "properties": {
                "filterFormula": {
                    "type": "string",
                    "description": "Filter search : Search for Records that match specified filters. (example: email, firstname, lastname, jobtitle, lifecyclestage)"
                },
                "paginationParameters": {
                    "type": "object",
                    "description": "Pagination parameters for paginated results",
                    "properties": {
                        "pageCursor": {
                            "type": "string",
                            "description": "The cursor indicating the current page"
                        }
                    },
                    "required": [],
                    "additionalProperties": false
                }
            },
            "required": [],
            "additionalProperties": false

The results:

Description

Tool Correctness

Task Completion

regular-detailed

74.8%

53.0%

extra-detailed

74.5%

53.7%

Multi-tool Test Cases

Description

Tool Correctness

Task Completion

regular-detailed

44.1%

37.5%

extra-detailed

50%

50%

Similar to the system prompt axis, we observed negligible differences when it came to overall results, but a measurable improvement in test cases that involve multiple tools calls.

Takeaways and Implications

Based on all of the results, we saw the most significant impact on our metrics coming from LLM choice. OpenAI’s newest o3 update, o3-2025-04-16 performed the best across tool correctness and task completion. Tool calling performance will only improve as models become better. Consistent testing processes are the most future-proof solution for ensuring you’re taking advantage of the best models for tool calling.

Enhancing system prompts and descriptions had negligible effects on overall tool correctness and task completion, but did have positive impact on complex test cases that needed multiple tools and chaining.

Lastly, using routing to vary the number of tools per agent did not have noticeable impact on gpt-4o's ability to select the right tool, however had more impact on tool correctness when using Claude’s 3.5-sonnet.

This signals is that grid searching where we try every combination of our axes - LLM, system prompts, architecture, and descriptions - is necessary to identify the best combination of settings to optimize tool performance.

Limitations and Further Analysis

In this article, the breadth of different axes alongside the measurement framework gives us a very good preliminary understanding of how to optimize tool calling. Of course, there are many aspects that would further improve this evaluation:

  1. More test cases

    1. We had to manually write test cases rather than synthetically generating more test cases because of the way tool correctness and task completion were evaluated

  2. More test runs

    1. LLMs can respond differently even with the same prompt; additional test runs would add confidence to how an agent will respond “on average”

  3. Exploring combinations (grid search)

    1. We tested each axis independently where each axis started with gpt-4o and a very simple system prompt

    2. A more exhaustive exercise would have been to explore system prompts across each LLM, routing vs no-routing across each LLM+system prompt combination, etc.

Another limitation with this exercise is the nature of test cases for tool calling. These test cases are “one-shot” evaluations where a single response is evaluated after a given prompt. If you recall the o3-mini responses where our AI agent chose to ask followup/clarifying questions rather than just deciding on a tool, the evaluations deemed that interaction as a failure.

These one-shot tests do simulate use cases where an agent needs to perform an interaction outside of a chat interface. However for a chat interaction, test cases should simulate chat. It’s more difficult to create evaluations and test cases that involve a back-and-forth chat interaction, but not impossible.

Another really interesting topic we didn’t cover in this article but will explore in a future article is the impact of alternative multi-agent agent architectures. In this evaluation, we used a “perfect” routing system where we guaranteed that a prompt was routed to the correct agent with the smaller tool set. Other interesting areas to further explore:

  • Imperfect routing with a routing agent

  • Generator-Evaluator

  • Prompt Chaining

  • Tool Repair

To learn more about more of these architectures, see LangGraph’s documentation, Anthropic’s research, and Vercel’s tool calling documentation.

Conclusion

In this article, we managed to answer many of the questions we first sought to answer:

  • How do we improve tool calling?

  • Does changing the system prompt help?

  • Would different agent architectures improve tool calling?

The results emphasize the importance of LLM choice, as well as the potential for improving tool calling performance from better system prompts and descriptions. It also brought us exciting new questions about agent architecture and spoke to the potential of agents with 3rd-party actions.

If you found this article interesting and useful, check out our agent focused API we referenced in this article: ActionKit. If you’d like to talk in person, reach out to the team at Paragon! We specialize in working with AI SaaS companies build out 3rd-party integrations, and would love to learn about your specific use cases.

TABLE OF CONTENTS
    Table of contents will appear here.
Jack Mu
,

Developer Advocate

mins to read

Ship native integrations 7x faster with Paragon

Ready to get started?

Join 150+ SaaS & AI companies that are scaling their integration roadmaps with Paragon.

Ready to get started?

Join 150+ SaaS & AI companies that are scaling their integration roadmaps with Paragon.

Ready to get started?

Join 150+ SaaS & AI companies that are scaling their integration roadmaps with Paragon.

Ready to get started?

Join 150+ SaaS & AI companies that are scaling their integration roadmaps with Paragon.