Contextual Recall¶

The Contextual Recall metric evaluates how many claims in the reference answer are supported by the retrieved context. It measures the completeness of retrieval — whether the retriever found all the information needed to produce a correct answer.

A low Contextual Recall score indicates that important information is missing from the retrieved context. This typically means the knowledge base is incomplete, the chunk size is too small (splitting relevant information across chunks), or the similarity threshold is too high (filtering out relevant documents). This metric requires an expected_output to compare against.

How It Works¶

Claim Extraction — extracts individual claims from the expected output
Support Check — determines if each claim is supported by the retrieval context
Recall Calculation — supported_claims / total_claims

Parameters¶

Parameter	Type	Default	Description
`model`	`str`	required	LLM model (`"gpt-4o"`, `"anthropic:claude-3-5-sonnet-latest"`, `"google:gemini-2.0-flash"`, `"ollama:llama3"`, or `CustomLLMClient`)
`threshold`	`float`	`0.7`	Minimum score to pass

Required Fields¶

Field	Required
`input`	Yes
`expected_output`	Yes
`retrieval_context`	Yes

Usage¶

from eval_lib import ContextualRecallMetric, EvalTestCase, evaluate
import asyncio

test_case = EvalTestCase(
    input="What are the benefits of exercise?",
    actual_output="Exercise improves cardiovascular health and reduces stress.",
    expected_output="Exercise improves cardiovascular health, strengthens muscles, reduces stress, and boosts mental health.",
    retrieval_context=[
        "Regular exercise strengthens the heart and improves cardiovascular health.",
        "Physical activity is known to reduce stress hormones like cortisol.",
        "Exercise releases endorphins, which help improve mood and mental well-being."
    ]
)

metric = ContextualRecallMetric(model="gpt-4o", threshold=0.7)
results = asyncio.run(evaluate([test_case], [metric]))

Interpretation¶

1.0 — all claims in the expected output are supported by context
0.75 — 3 out of 4 claims are supported
0.5 — only half the expected information was retrieved
0.0 — context doesn't support any expected claims

Cost¶

2 LLM API calls per evaluation.

When to Use¶

Evaluating retrieval completeness
Ensuring your retriever captures all relevant information
Identifying missing context in your knowledge base