Contextual Relevancy¶
The Contextual Relevancy metric evaluates how well the retrieved context supports the user's question and intent. Unlike Contextual Precision/Recall which compare context to the reference answer, this metric focuses on the alignment between the question and the retrieved context.
This is one of the most important retriever metrics because it directly measures whether your search pipeline returns useful information. A low score means the retriever is returning noise — documents that may be topically related but don't actually help answer the question. This wastes context window tokens and can confuse the generator LLM, leading to lower-quality answers.
How It Works¶
- Intent Inference — analyzes the user's question to understand their true intent
- Context Evaluation — generates verdicts for each context segment's relevance to the intent
- Score Aggregation — combines verdicts using temperature-controlled softmax
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | required | LLM model ("gpt-4o", "anthropic:claude-3-5-sonnet-latest", "google:gemini-2.0-flash", "ollama:llama3", or CustomLLMClient) |
threshold | float | 0.6 | Minimum score to pass |
temperature | float | 0.5 | Aggregation strictness |
Required Fields¶
| Field | Required |
|---|---|
input | Yes |
actual_output | Yes |
retrieval_context | Yes |
Usage¶
from eval_lib import ContextualRelevancyMetric, EvalTestCase, evaluate
import asyncio
test_case = EvalTestCase(
input="How do I configure CORS in FastAPI?",
actual_output="Use CORSMiddleware from fastapi.middleware.cors.",
retrieval_context=[
"FastAPI provides CORSMiddleware for handling Cross-Origin Resource Sharing. Import it from fastapi.middleware.cors and add it to your app.",
"CORS allows web applications running at one origin to access resources from a different origin.",
"Django REST Framework uses django-cors-headers package for CORS configuration."
]
)
metric = ContextualRelevancyMetric(model="gpt-4o", threshold=0.6)
results = asyncio.run(evaluate([test_case], [metric]))
Cost¶
3 LLM API calls per evaluation.
When to Use¶
- Evaluating retriever quality independently from generation
- Identifying noisy or irrelevant chunks in retrieval results
- Tuning retrieval parameters (chunk size, top-k, similarity threshold)