Skip to content

Agent Metrics

Eval AI Library provides specialized metrics for evaluating AI agents — systems that use tools, maintain conversations, and complete tasks autonomously.

Unlike RAG metrics that focus on answer quality, agent metrics evaluate behavioral properties: Did the agent call the right tools? Did it complete the user's task? Does it stay in character? Does it remember context across conversation turns?

Available Metrics

Metric What It Measures LLM Calls Default Threshold
Tool Correctness Whether correct tools were called 0 0.5
Task Success Rate Whether the user's goal was achieved 4 0.7
Role Adherence How well the AI maintains its role 2 0.7
Knowledge Retention How well context is remembered across turns 2 0.7
Tool Error Detection Errors in tool/function usage 1 0.7

Test Cases for Agents

Single-Turn (Tool Use)

from eval_lib import EvalTestCase

test_case = EvalTestCase(
    input="Book a flight to Paris for next Monday",
    actual_output="I've booked a flight to Paris for next Monday, March 10th.",
    tools_called=["search_flights", "book_flight"],
    expected_tools=["search_flights", "book_flight"],
)

Multi-Turn (Conversation)

from eval_lib import ConversationalEvalTestCase, EvalTestCase

conversation = ConversationalEvalTestCase(
    chatbot_role="You are a financial advisor assistant.",
    turns=[
        EvalTestCase(
            input="I have $10,000 to invest.",
            actual_output="Let's discuss your risk tolerance and investment timeline."
        ),
        EvalTestCase(
            input="I'm moderate risk, investing for 5 years.",
            actual_output="For a moderate risk profile with a 5-year horizon, I recommend a diversified portfolio: 60% stocks, 30% bonds, 10% alternatives."
        ),
    ]
)

Choosing Metrics

The right combination of metrics depends on your agent's architecture:

  • Tool-using agents (API calls, function calling) — focus on Tool Correctness and Tool Error Detection to ensure the agent picks the right tools and uses them correctly.
  • Conversational assistants (customer support, advisors) — use Task Success, Role Adherence, and Knowledge Retention to verify the agent helps users effectively while staying in character.
  • Complex agents with tools + conversation — combine all metrics for comprehensive evaluation.
Scenario Recommended Metrics
Tool-using agents Tool Correctness + Tool Error Detection
Conversational assistants Task Success + Role Adherence + Knowledge Retention
Multi-turn agents with tools All agent metrics
Task completion evaluation Task Success Rate

Quick Example

import asyncio
from eval_lib import (
    evaluate,
    EvalTestCase,
    ToolCorrectnessMetric,
    TaskSuccessRateMetric,
)

test_case = EvalTestCase(
    input="What's the weather in Tokyo and convert 100 USD to JPY?",
    actual_output="Tokyo weather: 22°C, sunny. 100 USD = 15,000 JPY.",
    tools_called=["get_weather", "convert_currency"],
    expected_tools=["get_weather", "convert_currency"],
)

metrics = [
    ToolCorrectnessMetric(threshold=0.5),
    TaskSuccessRateMetric(model="gpt-4o", threshold=0.7),
]

results = asyncio.run(evaluate([test_case], metrics))