Skip to content

Contextual Recall

The Contextual Recall metric evaluates how many claims in the reference answer are supported by the retrieved context. It measures the completeness of retrieval — whether the retriever found all the information needed to produce a correct answer.

A low Contextual Recall score indicates that important information is missing from the retrieved context. This typically means the knowledge base is incomplete, the chunk size is too small (splitting relevant information across chunks), or the similarity threshold is too high (filtering out relevant documents). This metric requires an expected_output to compare against.

How It Works

  1. Claim Extraction — extracts individual claims from the expected output
  2. Support Check — determines if each claim is supported by the retrieval context
  3. Recall Calculationsupported_claims / total_claims

Parameters

Parameter Type Default Description
model str required LLM model ("gpt-4o", "anthropic:claude-3-5-sonnet-latest", "google:gemini-2.0-flash", "ollama:llama3", or CustomLLMClient)
threshold float 0.7 Minimum score to pass

Required Fields

Field Required
input Yes
expected_output Yes
retrieval_context Yes

Usage

from eval_lib import ContextualRecallMetric, EvalTestCase, evaluate
import asyncio

test_case = EvalTestCase(
    input="What are the benefits of exercise?",
    actual_output="Exercise improves cardiovascular health and reduces stress.",
    expected_output="Exercise improves cardiovascular health, strengthens muscles, reduces stress, and boosts mental health.",
    retrieval_context=[
        "Regular exercise strengthens the heart and improves cardiovascular health.",
        "Physical activity is known to reduce stress hormones like cortisol.",
        "Exercise releases endorphins, which help improve mood and mental well-being."
    ]
)

metric = ContextualRecallMetric(model="gpt-4o", threshold=0.7)
results = asyncio.run(evaluate([test_case], [metric]))

Interpretation

  • 1.0 — all claims in the expected output are supported by context
  • 0.75 — 3 out of 4 claims are supported
  • 0.5 — only half the expected information was retrieved
  • 0.0 — context doesn't support any expected claims

Cost

2 LLM API calls per evaluation.

When to Use

  • Evaluating retrieval completeness
  • Ensuring your retriever captures all relevant information
  • Identifying missing context in your knowledge base