Task Success Rate¶
The Task Success Rate metric evaluates whether the AI assistant successfully helped the user achieve their goal. Works with both single-turn and multi-turn conversations.
How It Works¶
- Goal Inference — analyzes the dialogue to understand the user's goal
- Criteria Generation — generates specific success criteria (minimum 1-2)
- Verdict Generation — evaluates each criterion using the 5-level scale
- Score Aggregation — combines verdicts using temperature-controlled softmax
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | required | LLM model ("gpt-4o", "anthropic:claude-3-5-sonnet-latest", "google:gemini-2.0-flash", "ollama:llama3", or CustomLLMClient) |
threshold | float | 0.7 | Minimum score to pass |
temperature | float | 0.5 | Aggregation strictness |
Required Fields¶
| Field | Required |
|---|---|
input | Yes |
actual_output | Yes |
Usage¶
Single-Turn¶
from eval_lib import TaskSuccessRateMetric, EvalTestCase, evaluate
import asyncio
test_case = EvalTestCase(
input="Help me write a regex to match email addresses.",
actual_output="Here's a regex for email matching: `[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}`. This handles most standard email formats."
)
metric = TaskSuccessRateMetric(model="gpt-4o", threshold=0.7)
results = asyncio.run(evaluate([test_case], [metric]))
Multi-Turn (Conversational)¶
from eval_lib import (
ConversationalEvalTestCase,
EvalTestCase,
evaluate_conversations,
TaskSuccessRateMetric,
)
conversation = ConversationalEvalTestCase(
turns=[
EvalTestCase(
input="I need to set up a CI/CD pipeline for my Node.js project.",
actual_output="I can help with that! Are you using GitHub Actions, GitLab CI, or another platform?"
),
EvalTestCase(
input="GitHub Actions.",
actual_output="Here's a workflow file for your Node.js project: [provides complete .github/workflows/ci.yml with test, lint, and deploy steps]"
),
EvalTestCase(
input="Can you add caching for node_modules?",
actual_output="Sure! Add this caching step: [provides cache action configuration with hash-based key]"
),
]
)
metric = TaskSuccessRateMetric(model="gpt-4o", threshold=0.7)
results = asyncio.run(evaluate_conversations([conversation], [metric]))
Cost¶
4 LLM API calls per evaluation.
Tips¶
- Use with Tool Correctness for agents that use tools to complete tasks
- For multi-turn conversations, the metric considers the entire dialogue trajectory
- Set lower threshold (0.5) for complex, open-ended tasks