Skip to content

Tool Correctness

The Tool Correctness metric evaluates whether the AI agent called the correct tools/functions to handle the user's request.

When building AI agents that interact with external tools (APIs, databases, search engines), it's essential to verify that the agent selects the right tools for each task. An agent that calls irrelevant functions wastes resources, while one that skips necessary tools produces incomplete results. This metric compares the tools actually called against the expected tool set, with support for strict and flexible matching modes.

This is a zero-cost metric — it uses purely algorithmic comparison with no LLM API calls, making it ideal for large-scale agent evaluation.

How It Works

Three evaluation modes:

Non-Exact Match (Default)

Calculates the proportion of expected tools that were actually called, regardless of order or extra tools.

Exact Match

Requires the exact same set of tools to be called — no extra tools, no missing tools.

Ordering Check

Uses Longest Common Subsequence (LCS) to evaluate if tools were called in the correct order, with weighted scoring.

Parameters

Parameter Type Default Description
threshold float 0.5 Minimum score to pass
exact_match bool False Require exact tool set match
check_ordering bool False Evaluate tool call order

Required Fields

Field Required
tools_called Yes
expected_tools Yes

Usage

from eval_lib import ToolCorrectnessMetric, EvalTestCase, evaluate
import asyncio

test_case = EvalTestCase(
    input="Search for hotels in Paris and book the cheapest one.",
    actual_output="I found and booked Hotel Le Marais for $120/night.",
    tools_called=["search_hotels", "sort_by_price", "book_hotel"],
    expected_tools=["search_hotels", "book_hotel"],
)

# Non-exact: passes (all expected tools were called)
metric = ToolCorrectnessMetric(threshold=0.5)

# Exact: fails (extra tool "sort_by_price" was called)
metric_exact = ToolCorrectnessMetric(threshold=0.5, exact_match=True)

# With ordering: checks if search happened before booking
metric_ordered = ToolCorrectnessMetric(threshold=0.5, check_ordering=True)

results = asyncio.run(evaluate([test_case], [metric]))

Scoring Examples

Non-Exact Match

expected_tools = ["search", "book"]
tools_called = ["search", "validate", "book"]
# Score: 1.0 (all expected tools found)

tools_called = ["search"]
# Score: 0.5 (1 of 2 expected tools found)

tools_called = ["validate"]
# Score: 0.0 (none of expected tools found)

Exact Match

expected_tools = ["search", "book"]
tools_called = ["search", "book"]
# Score: 1.0 (exact match)

tools_called = ["search", "validate", "book"]
# Score: 0.0 (extra tool present)

Cost

0 LLM API calls — purely algorithmic comparison.