Skip to content

Answer Relevancy

The Answer Relevancy metric evaluates how well a chatbot's answer addresses the user's intent and question.

How It Works

graph TD
    A[Input Question] --> B[1. Infer User Intent]
    A --> C[Actual Output]
    C --> D[2. Extract Atomic Statements]
    D --> E[3. Generate Verdicts per Statement]
    B --> E
    E --> F[4. Aggregate Scores via Softmax]
    F --> G[Final Score 0.0-1.0]
  1. Intent Inference — analyzes the user's question to understand their true intent
  2. Statement Extraction — breaks the answer into atomic, verifiable statements
  3. Verdict Generation — evaluates each statement against the inferred intent using a 5-level scale
  4. Score Aggregation — combines verdicts using temperature-controlled softmax

Parameters

Parameter Type Default Description
model str required LLM model ("gpt-4o", "anthropic:claude-3-5-sonnet-latest", "google:gemini-2.0-flash", "ollama:llama3", or CustomLLMClient)
threshold float 0.6 Minimum score to pass
temperature float 0.5 Aggregation strictness (0.1=strict, 1.0=lenient)

Required Fields

Field Required
input Yes
actual_output Yes
expected_output No
retrieval_context No

Usage

from eval_lib import AnswerRelevancyMetric, EvalTestCase, evaluate
import asyncio

test_case = EvalTestCase(
    input="How do I reset my password?",
    actual_output="To reset your password, go to Settings > Security > Change Password. Enter your current password and then your new password twice. Click Save to confirm."
)

metric = AnswerRelevancyMetric(
    model="gpt-4o",
    threshold=0.7,
    temperature=0.5
)

results = asyncio.run(evaluate([test_case], [metric]))

Scoring Details

The metric uses the standard verdict system:

Verdict Weight Example
fully 1.0 Statement directly answers the question
mostly 0.9 Statement is highly relevant with minor gaps
partial 0.7 Statement is somewhat relevant
minor 0.3 Statement barely touches on the topic
none 0.0 Statement is completely off-topic

Temperature Effect

  • t=0.1 (strict): A single off-topic statement can significantly lower the score
  • t=0.5 (balanced): Score reflects the arithmetic mean of verdict weights
  • t=1.0 (lenient): Score is forgiving if most statements are relevant

Cost

4 LLM API calls per evaluation:

  1. Intent inference
  2. Statement extraction
  3. Verdict generation
  4. Score explanation

Example Scenarios

High Score (0.95+)

EvalTestCase(
    input="What is the capital of France?",
    actual_output="The capital of France is Paris. It is located in northern France along the Seine River."
)
# All statements directly address the question

Low Score (< 0.5)

EvalTestCase(
    input="What is the capital of France?",
    actual_output="France is a beautiful country with great food. The Eiffel Tower is a popular tourist attraction. French wine is world-renowned."
)
# Statements are about France but don't answer the question