Contextual Recall¶
The Contextual Recall metric evaluates how many claims in the reference answer are supported by the retrieved context. It measures the completeness of retrieval — whether the retriever found all the information needed to produce a correct answer.
A low Contextual Recall score indicates that important information is missing from the retrieved context. This typically means the knowledge base is incomplete, the chunk size is too small (splitting relevant information across chunks), or the similarity threshold is too high (filtering out relevant documents). This metric requires an expected_output to compare against.
How It Works¶
- Claim Extraction — extracts individual claims from the expected output
- Support Check — determines if each claim is supported by the retrieval context
- Recall Calculation —
supported_claims / total_claims
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | required | LLM model ("gpt-4o", "anthropic:claude-3-5-sonnet-latest", "google:gemini-2.0-flash", "ollama:llama3", or CustomLLMClient) |
threshold | float | 0.7 | Minimum score to pass |
Required Fields¶
| Field | Required |
|---|---|
input | Yes |
expected_output | Yes |
retrieval_context | Yes |
Usage¶
from eval_lib import ContextualRecallMetric, EvalTestCase, evaluate
import asyncio
test_case = EvalTestCase(
input="What are the benefits of exercise?",
actual_output="Exercise improves cardiovascular health and reduces stress.",
expected_output="Exercise improves cardiovascular health, strengthens muscles, reduces stress, and boosts mental health.",
retrieval_context=[
"Regular exercise strengthens the heart and improves cardiovascular health.",
"Physical activity is known to reduce stress hormones like cortisol.",
"Exercise releases endorphins, which help improve mood and mental well-being."
]
)
metric = ContextualRecallMetric(model="gpt-4o", threshold=0.7)
results = asyncio.run(evaluate([test_case], [metric]))
Interpretation¶
- 1.0 — all claims in the expected output are supported by context
- 0.75 — 3 out of 4 claims are supported
- 0.5 — only half the expected information was retrieved
- 0.0 — context doesn't support any expected claims
Cost¶
2 LLM API calls per evaluation.
When to Use¶
- Evaluating retrieval completeness
- Ensuring your retriever captures all relevant information
- Identifying missing context in your knowledge base