Tool Error Detection¶

The Tool Error Detection metric identifies errors in an AI agent's tool/function usage patterns.

AI agents that use tools can fail in subtle ways beyond simply calling the wrong function. They might pass incorrect parameter types, call tools in the wrong order (e.g., trying to process data before fetching it), ignore error responses from tools, or repeatedly retry failed operations without adjusting their approach. These errors are hard to catch with simple pass/fail testing but can significantly degrade the user experience.

This metric uses an LLM judge to analyze the agent's tool usage trace and identify specific error patterns, returning both the error type and a confidence score.

Error Types Detected¶

Error Type	Description
`parameter_error`	Wrong types, missing required params, invalid values
`invalid_function`	Calling non-existent or unavailable functions
`sequence_error`	Calling tools in the wrong order
`result_ignored`	Ignoring tool results or error responses
`repeated_failure`	Making the same error multiple times
`error_handling`	Poor error handling, false success claims

Parameters¶

Parameter	Type	Default	Description
`model`	`str`	required	LLM model (`"gpt-4o"`, `"anthropic:claude-3-5-sonnet-latest"`, `"google:gemini-2.0-flash"`, `"ollama:llama3"`, or `CustomLLMClient`)
`threshold`	`float`	`0.7`	Minimum confidence threshold
`error_types`	`list[str]`	all types	Which error types to check for
`verbose`	`bool`	`False`	Enable detailed logging

Required Fields¶

Field	Required
`input`	Yes
`actual_output`	Yes
`tools_called`	Recommended
`reasoning`	Recommended

Usage¶

from eval_lib import ToolsErrorMetric, EvalTestCase, evaluate
import asyncio

test_case = EvalTestCase(
    input="Get the user's order history and calculate the total spent.",
    actual_output="The user has spent $1,500 total across 12 orders.",
    tools_called=["get_user_orders", "calculate_total"],
    reasoning="Called get_user_orders which returned 12 orders, then calculate_total to sum them up."
)

metric = ToolsErrorMetric(
    model="gpt-4o",
    threshold=0.7,
    error_types=["parameter_error", "sequence_error", "result_ignored"]
)

results = asyncio.run(evaluate([test_case], [metric]))

Check Only Specific Error Types¶

# Only check for parameter and sequence errors
metric = ToolsErrorMetric(
    model="gpt-4o",
    threshold=0.7,
    error_types=["parameter_error", "sequence_error"]
)

Result Format¶

The evaluation log includes detected errors with confidence scores:

result.evaluation_log = {
    "errors_detected": [
        {
            "type": "sequence_error",
            "confidence": 0.85,
            "description": "calculate_total was called before get_user_orders returned"
        }
    ]
}

Cost¶

1 LLM API call per evaluation.