Answer Relevancy¶

Name: Eval AI Library
Author: Aleksandr Meshkov

The Answer Relevancy metric evaluates how well a chatbot's answer addresses the user's intent and question.

How It Works¶

graph TD
    A[Input Question] --> B[1. Infer User Intent]
    A --> C[Actual Output]
    C --> D[2. Extract Atomic Statements]
    D --> E[3. Generate Verdicts per Statement]
    B --> E
    E --> F[4. Aggregate Scores via Softmax]
    F --> G[Final Score 0.0-1.0]

Intent Inference — analyzes the user's question to understand their true intent
Statement Extraction — breaks the answer into atomic, verifiable statements
Verdict Generation — evaluates each statement against the inferred intent using a 5-level scale
Score Aggregation — combines verdicts using temperature-controlled softmax

Parameters¶

Parameter	Type	Default	Description
`model`	`str`	required	LLM model (`"gpt-4o"`, `"anthropic:claude-3-5-sonnet-latest"`, `"google:gemini-2.0-flash"`, `"ollama:llama3"`, or `CustomLLMClient`)
`threshold`	`float`	`0.6`	Minimum score to pass
`temperature`	`float`	`0.5`	Aggregation strictness (0.1=strict, 1.0=lenient)

Required Fields¶

Field	Required
`input`	Yes
`actual_output`	Yes
`expected_output`	No
`retrieval_context`	No

Usage¶

from eval_lib import AnswerRelevancyMetric, EvalTestCase, evaluate
import asyncio

test_case = EvalTestCase(
    input="How do I reset my password?",
    actual_output="To reset your password, go to Settings > Security > Change Password. Enter your current password and then your new password twice. Click Save to confirm."
)

metric = AnswerRelevancyMetric(
    model="gpt-4o",
    threshold=0.7,
    temperature=0.5
)

results = asyncio.run(evaluate([test_case], [metric]))

Scoring Details¶

The metric uses the standard verdict system:

Verdict	Weight	Example
`fully`	1.0	Statement directly answers the question
`mostly`	0.9	Statement is highly relevant with minor gaps
`partial`	0.7	Statement is somewhat relevant
`minor`	0.3	Statement barely touches on the topic
`none`	0.0	Statement is completely off-topic

Temperature Effect¶

t=0.1 (strict): A single off-topic statement can significantly lower the score
t=0.5 (balanced): Score reflects the arithmetic mean of verdict weights
t=1.0 (lenient): Score is forgiving if most statements are relevant

Cost¶

4 LLM API calls per evaluation:

Intent inference
Statement extraction
Verdict generation
Score explanation

Example Scenarios¶

High Score (0.95+)¶

EvalTestCase(
    input="What is the capital of France?",
    actual_output="The capital of France is Paris. It is located in northern France along the Seine River."
)
# All statements directly address the question

Low Score (< 0.5)¶

EvalTestCase(
    input="What is the capital of France?",
    actual_output="France is a beautiful country with great food. The Eiffel Tower is a popular tourist attraction. French wine is world-renowned."
)
# Statements are about France but don't answer the question