Tool Correctness¶
The Tool Correctness metric evaluates whether the AI agent called the correct tools/functions to handle the user's request.
When building AI agents that interact with external tools (APIs, databases, search engines), it's essential to verify that the agent selects the right tools for each task. An agent that calls irrelevant functions wastes resources, while one that skips necessary tools produces incomplete results. This metric compares the tools actually called against the expected tool set, with support for strict and flexible matching modes.
This is a zero-cost metric — it uses purely algorithmic comparison with no LLM API calls, making it ideal for large-scale agent evaluation.
How It Works¶
Three evaluation modes:
Non-Exact Match (Default)¶
Calculates the proportion of expected tools that were actually called, regardless of order or extra tools.
Exact Match¶
Requires the exact same set of tools to be called — no extra tools, no missing tools.
Ordering Check¶
Uses Longest Common Subsequence (LCS) to evaluate if tools were called in the correct order, with weighted scoring.
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
threshold | float | 0.5 | Minimum score to pass |
exact_match | bool | False | Require exact tool set match |
check_ordering | bool | False | Evaluate tool call order |
Required Fields¶
| Field | Required |
|---|---|
tools_called | Yes |
expected_tools | Yes |
Usage¶
from eval_lib import ToolCorrectnessMetric, EvalTestCase, evaluate
import asyncio
test_case = EvalTestCase(
input="Search for hotels in Paris and book the cheapest one.",
actual_output="I found and booked Hotel Le Marais for $120/night.",
tools_called=["search_hotels", "sort_by_price", "book_hotel"],
expected_tools=["search_hotels", "book_hotel"],
)
# Non-exact: passes (all expected tools were called)
metric = ToolCorrectnessMetric(threshold=0.5)
# Exact: fails (extra tool "sort_by_price" was called)
metric_exact = ToolCorrectnessMetric(threshold=0.5, exact_match=True)
# With ordering: checks if search happened before booking
metric_ordered = ToolCorrectnessMetric(threshold=0.5, check_ordering=True)
results = asyncio.run(evaluate([test_case], [metric]))
Scoring Examples¶
Non-Exact Match¶
expected_tools = ["search", "book"]
tools_called = ["search", "validate", "book"]
# Score: 1.0 (all expected tools found)
tools_called = ["search"]
# Score: 0.5 (1 of 2 expected tools found)
tools_called = ["validate"]
# Score: 0.0 (none of expected tools found)
Exact Match¶
expected_tools = ["search", "book"]
tools_called = ["search", "book"]
# Score: 1.0 (exact match)
tools_called = ["search", "validate", "book"]
# Score: 0.0 (extra tool present)
Cost¶
0 LLM API calls — purely algorithmic comparison.