Contextual Precision¶

The Contextual Precision metric measures the precision of retrieved context chunks — whether relevant chunks are ranked higher than irrelevant ones.

In RAG systems, the order of retrieved documents matters. Most LLMs pay more attention to context that appears earlier in the prompt. If irrelevant chunks are ranked above relevant ones, the generator may focus on the wrong information and produce a lower-quality answer. This metric uses a Precision@K formula from information retrieval to measure ranking quality.

How It Works¶

Uses a Precision@K formula inspired by information retrieval:

\[ \text{Context Precision@K} = \frac{\sum_{k=1}^{K} (\text{Precision@k} \times v_k)}{\text{Number of relevant chunks}} \]

Where \(v_k = 1\) if chunk \(k\) is relevant to the reference answer, \(0\) otherwise.

Relevance Check — for each context chunk, determines if it's relevant to the expected output
Ranking Evaluation — measures whether relevant chunks appear before irrelevant ones
Precision Calculation — computes weighted precision based on chunk positions

Parameters¶

Parameter	Type	Default	Description
`model`	`str`	required	LLM model (`"gpt-4o"`, `"anthropic:claude-3-5-sonnet-latest"`, `"google:gemini-2.0-flash"`, `"ollama:llama3"`, or `CustomLLMClient`)
`threshold`	`float`	`0.7`	Minimum score to pass
`top_k`	`int`	`None`	Limit evaluation to top K chunks

Required Fields¶

Field	Required
`input`	Yes
`actual_output`	Yes
`expected_output`	Yes
`retrieval_context`	Yes

Usage¶

from eval_lib import ContextualPrecisionMetric, EvalTestCase, evaluate
import asyncio

test_case = EvalTestCase(
    input="What is transfer learning?",
    actual_output="Transfer learning reuses a pre-trained model on a new task.",
    expected_output="Transfer learning is a technique where a model trained on one task is reused as the starting point for a model on a second task.",
    retrieval_context=[
        "Transfer learning is an ML technique where a model developed for one task is reused for a different task.",  # Relevant
        "The weather in Tokyo is sunny with temperatures around 25°C.",  # Irrelevant
        "Fine-tuning is a common approach in transfer learning where pre-trained weights are adjusted.",  # Relevant
    ]
)

metric = ContextualPrecisionMetric(model="gpt-4o", threshold=0.7)
results = asyncio.run(evaluate([test_case], [metric]))

Cost¶

N LLM API calls, where N = number of context chunks. Each chunk is evaluated independently.

Interpretation¶

1.0 — all relevant chunks are ranked before irrelevant ones
0.5-0.8 — relevant chunks are mixed with irrelevant ones
< 0.5 — irrelevant chunks are ranked higher than relevant ones (poor retrieval)