Agent Metrics¶

Eval AI Library provides specialized metrics for evaluating AI agents — systems that use tools, maintain conversations, and complete tasks autonomously.

Unlike RAG metrics that focus on answer quality, agent metrics evaluate behavioral properties: Did the agent call the right tools? Did it complete the user's task? Does it stay in character? Does it remember context across conversation turns?

Available Metrics¶

Metric	What It Measures	LLM Calls	Default Threshold
Tool Correctness	Whether correct tools were called	0	0.5
Task Success Rate	Whether the user's goal was achieved	4	0.7
Role Adherence	How well the AI maintains its role	2	0.7
Knowledge Retention	How well context is remembered across turns	2	0.7
Tool Error Detection	Errors in tool/function usage	1	0.7

Test Cases for Agents¶

Single-Turn (Tool Use)¶

from eval_lib import EvalTestCase

test_case = EvalTestCase(
    input="Book a flight to Paris for next Monday",
    actual_output="I've booked a flight to Paris for next Monday, March 10th.",
    tools_called=["search_flights", "book_flight"],
    expected_tools=["search_flights", "book_flight"],
)

Multi-Turn (Conversation)¶

from eval_lib import ConversationalEvalTestCase, EvalTestCase

conversation = ConversationalEvalTestCase(
    chatbot_role="You are a financial advisor assistant.",
    turns=[
        EvalTestCase(
            input="I have $10,000 to invest.",
            actual_output="Let's discuss your risk tolerance and investment timeline."
        ),
        EvalTestCase(
            input="I'm moderate risk, investing for 5 years.",
            actual_output="For a moderate risk profile with a 5-year horizon, I recommend a diversified portfolio: 60% stocks, 30% bonds, 10% alternatives."
        ),
    ]
)

Choosing Metrics¶

The right combination of metrics depends on your agent's architecture:

Tool-using agents (API calls, function calling) — focus on Tool Correctness and Tool Error Detection to ensure the agent picks the right tools and uses them correctly.
Conversational assistants (customer support, advisors) — use Task Success, Role Adherence, and Knowledge Retention to verify the agent helps users effectively while staying in character.
Complex agents with tools + conversation — combine all metrics for comprehensive evaluation.

Scenario	Recommended Metrics
Tool-using agents	Tool Correctness + Tool Error Detection
Conversational assistants	Task Success + Role Adherence + Knowledge Retention
Multi-turn agents with tools	All agent metrics
Task completion evaluation	Task Success Rate

Quick Example¶

import asyncio
from eval_lib import (
    evaluate,
    EvalTestCase,
    ToolCorrectnessMetric,
    TaskSuccessRateMetric,
)

test_case = EvalTestCase(
    input="What's the weather in Tokyo and convert 100 USD to JPY?",
    actual_output="Tokyo weather: 22°C, sunny. 100 USD = 15,000 JPY.",
    tools_called=["get_weather", "convert_currency"],
    expected_tools=["get_weather", "convert_currency"],
)

metrics = [
    ToolCorrectnessMetric(threshold=0.5),
    TaskSuccessRateMetric(model="gpt-4o", threshold=0.7),
]

results = asyncio.run(evaluate([test_case], metrics))