Skip to content

Task Success Rate

The Task Success Rate metric evaluates whether the AI assistant successfully helped the user achieve their goal. Works with both single-turn and multi-turn conversations.

How It Works

  1. Goal Inference — analyzes the dialogue to understand the user's goal
  2. Criteria Generation — generates specific success criteria (minimum 1-2)
  3. Verdict Generation — evaluates each criterion using the 5-level scale
  4. Score Aggregation — combines verdicts using temperature-controlled softmax

Parameters

Parameter Type Default Description
model str required LLM model ("gpt-4o", "anthropic:claude-3-5-sonnet-latest", "google:gemini-2.0-flash", "ollama:llama3", or CustomLLMClient)
threshold float 0.7 Minimum score to pass
temperature float 0.5 Aggregation strictness

Required Fields

Field Required
input Yes
actual_output Yes

Usage

Single-Turn

from eval_lib import TaskSuccessRateMetric, EvalTestCase, evaluate
import asyncio

test_case = EvalTestCase(
    input="Help me write a regex to match email addresses.",
    actual_output="Here's a regex for email matching: `[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}`. This handles most standard email formats."
)

metric = TaskSuccessRateMetric(model="gpt-4o", threshold=0.7)
results = asyncio.run(evaluate([test_case], [metric]))

Multi-Turn (Conversational)

from eval_lib import (
    ConversationalEvalTestCase,
    EvalTestCase,
    evaluate_conversations,
    TaskSuccessRateMetric,
)

conversation = ConversationalEvalTestCase(
    turns=[
        EvalTestCase(
            input="I need to set up a CI/CD pipeline for my Node.js project.",
            actual_output="I can help with that! Are you using GitHub Actions, GitLab CI, or another platform?"
        ),
        EvalTestCase(
            input="GitHub Actions.",
            actual_output="Here's a workflow file for your Node.js project: [provides complete .github/workflows/ci.yml with test, lint, and deploy steps]"
        ),
        EvalTestCase(
            input="Can you add caching for node_modules?",
            actual_output="Sure! Add this caching step: [provides cache action configuration with hash-based key]"
        ),
    ]
)

metric = TaskSuccessRateMetric(model="gpt-4o", threshold=0.7)
results = asyncio.run(evaluate_conversations([conversation], [metric]))

Cost

4 LLM API calls per evaluation.

Tips

  • Use with Tool Correctness for agents that use tools to complete tasks
  • For multi-turn conversations, the metric considers the entire dialogue trajectory
  • Set lower threshold (0.5) for complex, open-ended tasks