Skip to content

Tool Error Detection

The Tool Error Detection metric identifies errors in an AI agent's tool/function usage patterns.

AI agents that use tools can fail in subtle ways beyond simply calling the wrong function. They might pass incorrect parameter types, call tools in the wrong order (e.g., trying to process data before fetching it), ignore error responses from tools, or repeatedly retry failed operations without adjusting their approach. These errors are hard to catch with simple pass/fail testing but can significantly degrade the user experience.

This metric uses an LLM judge to analyze the agent's tool usage trace and identify specific error patterns, returning both the error type and a confidence score.

Error Types Detected

Error Type Description
parameter_error Wrong types, missing required params, invalid values
invalid_function Calling non-existent or unavailable functions
sequence_error Calling tools in the wrong order
result_ignored Ignoring tool results or error responses
repeated_failure Making the same error multiple times
error_handling Poor error handling, false success claims

Parameters

Parameter Type Default Description
model str required LLM model ("gpt-4o", "anthropic:claude-3-5-sonnet-latest", "google:gemini-2.0-flash", "ollama:llama3", or CustomLLMClient)
threshold float 0.7 Minimum confidence threshold
error_types list[str] all types Which error types to check for
verbose bool False Enable detailed logging

Required Fields

Field Required
input Yes
actual_output Yes
tools_called Recommended
reasoning Recommended

Usage

from eval_lib import ToolsErrorMetric, EvalTestCase, evaluate
import asyncio

test_case = EvalTestCase(
    input="Get the user's order history and calculate the total spent.",
    actual_output="The user has spent $1,500 total across 12 orders.",
    tools_called=["get_user_orders", "calculate_total"],
    reasoning="Called get_user_orders which returned 12 orders, then calculate_total to sum them up."
)

metric = ToolsErrorMetric(
    model="gpt-4o",
    threshold=0.7,
    error_types=["parameter_error", "sequence_error", "result_ignored"]
)

results = asyncio.run(evaluate([test_case], [metric]))

Check Only Specific Error Types

# Only check for parameter and sequence errors
metric = ToolsErrorMetric(
    model="gpt-4o",
    threshold=0.7,
    error_types=["parameter_error", "sequence_error"]
)

Result Format

The evaluation log includes detected errors with confidence scores:

result.evaluation_log = {
    "errors_detected": [
        {
            "type": "sequence_error",
            "confidence": 0.85,
            "description": "calculate_total was called before get_user_orders returned"
        }
    ]
}

Cost

1 LLM API call per evaluation.