Agent Metrics¶
Eval AI Library provides specialized metrics for evaluating AI agents — systems that use tools, maintain conversations, and complete tasks autonomously.
Unlike RAG metrics that focus on answer quality, agent metrics evaluate behavioral properties: Did the agent call the right tools? Did it complete the user's task? Does it stay in character? Does it remember context across conversation turns?
Available Metrics¶
| Metric | What It Measures | LLM Calls | Default Threshold |
|---|---|---|---|
| Tool Correctness | Whether correct tools were called | 0 | 0.5 |
| Task Success Rate | Whether the user's goal was achieved | 4 | 0.7 |
| Role Adherence | How well the AI maintains its role | 2 | 0.7 |
| Knowledge Retention | How well context is remembered across turns | 2 | 0.7 |
| Tool Error Detection | Errors in tool/function usage | 1 | 0.7 |
Test Cases for Agents¶
Single-Turn (Tool Use)¶
from eval_lib import EvalTestCase
test_case = EvalTestCase(
input="Book a flight to Paris for next Monday",
actual_output="I've booked a flight to Paris for next Monday, March 10th.",
tools_called=["search_flights", "book_flight"],
expected_tools=["search_flights", "book_flight"],
)
Multi-Turn (Conversation)¶
from eval_lib import ConversationalEvalTestCase, EvalTestCase
conversation = ConversationalEvalTestCase(
chatbot_role="You are a financial advisor assistant.",
turns=[
EvalTestCase(
input="I have $10,000 to invest.",
actual_output="Let's discuss your risk tolerance and investment timeline."
),
EvalTestCase(
input="I'm moderate risk, investing for 5 years.",
actual_output="For a moderate risk profile with a 5-year horizon, I recommend a diversified portfolio: 60% stocks, 30% bonds, 10% alternatives."
),
]
)
Choosing Metrics¶
The right combination of metrics depends on your agent's architecture:
- Tool-using agents (API calls, function calling) — focus on Tool Correctness and Tool Error Detection to ensure the agent picks the right tools and uses them correctly.
- Conversational assistants (customer support, advisors) — use Task Success, Role Adherence, and Knowledge Retention to verify the agent helps users effectively while staying in character.
- Complex agents with tools + conversation — combine all metrics for comprehensive evaluation.
| Scenario | Recommended Metrics |
|---|---|
| Tool-using agents | Tool Correctness + Tool Error Detection |
| Conversational assistants | Task Success + Role Adherence + Knowledge Retention |
| Multi-turn agents with tools | All agent metrics |
| Task completion evaluation | Task Success Rate |
Quick Example¶
import asyncio
from eval_lib import (
evaluate,
EvalTestCase,
ToolCorrectnessMetric,
TaskSuccessRateMetric,
)
test_case = EvalTestCase(
input="What's the weather in Tokyo and convert 100 USD to JPY?",
actual_output="Tokyo weather: 22°C, sunny. 100 USD = 15,000 JPY.",
tools_called=["get_weather", "convert_currency"],
expected_tools=["get_weather", "convert_currency"],
)
metrics = [
ToolCorrectnessMetric(threshold=0.5),
TaskSuccessRateMetric(model="gpt-4o", threshold=0.7),
]
results = asyncio.run(evaluate([test_case], metrics))