Task Success Rate¶

Name: Eval AI Library
Author: Aleksandr Meshkov

The Task Success Rate metric evaluates whether the AI assistant successfully helped the user achieve their goal. Works with both single-turn and multi-turn conversations.

How It Works¶

Goal Inference — analyzes the dialogue to understand the user's goal
Criteria Generation — generates specific success criteria (minimum 1-2)
Verdict Generation — evaluates each criterion using the 5-level scale
Score Aggregation — combines verdicts using temperature-controlled softmax

Parameters¶

Parameter	Type	Default	Description
`model`	`str`	required	LLM model (`"gpt-4o"`, `"anthropic:claude-3-5-sonnet-latest"`, `"google:gemini-2.0-flash"`, `"ollama:llama3"`, or `CustomLLMClient`)
`threshold`	`float`	`0.7`	Minimum score to pass
`temperature`	`float`	`0.5`	Aggregation strictness

Required Fields¶

Field	Required
`input`	Yes
`actual_output`	Yes

Usage¶

Single-Turn¶

from eval_lib import TaskSuccessRateMetric, EvalTestCase, evaluate
import asyncio

test_case = EvalTestCase(
    input="Help me write a regex to match email addresses.",
    actual_output="Here's a regex for email matching: `[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}`. This handles most standard email formats."
)

metric = TaskSuccessRateMetric(model="gpt-4o", threshold=0.7)
results = asyncio.run(evaluate([test_case], [metric]))

Multi-Turn (Conversational)¶

from eval_lib import (
    ConversationalEvalTestCase,
    EvalTestCase,
    evaluate_conversations,
    TaskSuccessRateMetric,
)

conversation = ConversationalEvalTestCase(
    turns=[
        EvalTestCase(
            input="I need to set up a CI/CD pipeline for my Node.js project.",
            actual_output="I can help with that! Are you using GitHub Actions, GitLab CI, or another platform?"
        ),
        EvalTestCase(
            input="GitHub Actions.",
            actual_output="Here's a workflow file for your Node.js project: [provides complete .github/workflows/ci.yml with test, lint, and deploy steps]"
        ),
        EvalTestCase(
            input="Can you add caching for node_modules?",
            actual_output="Sure! Add this caching step: [provides cache action configuration with hash-based key]"
        ),
    ]
)

metric = TaskSuccessRateMetric(model="gpt-4o", threshold=0.7)
results = asyncio.run(evaluate_conversations([conversation], [metric]))

Cost¶

4 LLM API calls per evaluation.

Tips¶

Use with Tool Correctness for agents that use tools to complete tasks
For multi-turn conversations, the metric considers the entire dialogue trajectory
Set lower threshold (0.5) for complex, open-ended tasks