Answer Relevancy¶
The Answer Relevancy metric evaluates how well a chatbot's answer addresses the user's intent and question.
How It Works¶
graph TD
A[Input Question] --> B[1. Infer User Intent]
A --> C[Actual Output]
C --> D[2. Extract Atomic Statements]
D --> E[3. Generate Verdicts per Statement]
B --> E
E --> F[4. Aggregate Scores via Softmax]
F --> G[Final Score 0.0-1.0] - Intent Inference — analyzes the user's question to understand their true intent
- Statement Extraction — breaks the answer into atomic, verifiable statements
- Verdict Generation — evaluates each statement against the inferred intent using a 5-level scale
- Score Aggregation — combines verdicts using temperature-controlled softmax
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | required | LLM model ("gpt-4o", "anthropic:claude-3-5-sonnet-latest", "google:gemini-2.0-flash", "ollama:llama3", or CustomLLMClient) |
threshold | float | 0.6 | Minimum score to pass |
temperature | float | 0.5 | Aggregation strictness (0.1=strict, 1.0=lenient) |
Required Fields¶
| Field | Required |
|---|---|
input | Yes |
actual_output | Yes |
expected_output | No |
retrieval_context | No |
Usage¶
from eval_lib import AnswerRelevancyMetric, EvalTestCase, evaluate
import asyncio
test_case = EvalTestCase(
input="How do I reset my password?",
actual_output="To reset your password, go to Settings > Security > Change Password. Enter your current password and then your new password twice. Click Save to confirm."
)
metric = AnswerRelevancyMetric(
model="gpt-4o",
threshold=0.7,
temperature=0.5
)
results = asyncio.run(evaluate([test_case], [metric]))
Scoring Details¶
The metric uses the standard verdict system:
| Verdict | Weight | Example |
|---|---|---|
fully | 1.0 | Statement directly answers the question |
mostly | 0.9 | Statement is highly relevant with minor gaps |
partial | 0.7 | Statement is somewhat relevant |
minor | 0.3 | Statement barely touches on the topic |
none | 0.0 | Statement is completely off-topic |
Temperature Effect¶
- t=0.1 (strict): A single off-topic statement can significantly lower the score
- t=0.5 (balanced): Score reflects the arithmetic mean of verdict weights
- t=1.0 (lenient): Score is forgiving if most statements are relevant
Cost¶
4 LLM API calls per evaluation:
- Intent inference
- Statement extraction
- Verdict generation
- Score explanation
Example Scenarios¶
High Score (0.95+)¶
EvalTestCase(
input="What is the capital of France?",
actual_output="The capital of France is Paris. It is located in northern France along the Seine River."
)
# All statements directly address the question