Contextual Precision¶
The Contextual Precision metric measures the precision of retrieved context chunks — whether relevant chunks are ranked higher than irrelevant ones.
In RAG systems, the order of retrieved documents matters. Most LLMs pay more attention to context that appears earlier in the prompt. If irrelevant chunks are ranked above relevant ones, the generator may focus on the wrong information and produce a lower-quality answer. This metric uses a Precision@K formula from information retrieval to measure ranking quality.
How It Works¶
Uses a Precision@K formula inspired by information retrieval:
Where \(v_k = 1\) if chunk \(k\) is relevant to the reference answer, \(0\) otherwise.
- Relevance Check — for each context chunk, determines if it's relevant to the expected output
- Ranking Evaluation — measures whether relevant chunks appear before irrelevant ones
- Precision Calculation — computes weighted precision based on chunk positions
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | required | LLM model ("gpt-4o", "anthropic:claude-3-5-sonnet-latest", "google:gemini-2.0-flash", "ollama:llama3", or CustomLLMClient) |
threshold | float | 0.7 | Minimum score to pass |
top_k | int | None | Limit evaluation to top K chunks |
Required Fields¶
| Field | Required |
|---|---|
input | Yes |
actual_output | Yes |
expected_output | Yes |
retrieval_context | Yes |
Usage¶
from eval_lib import ContextualPrecisionMetric, EvalTestCase, evaluate
import asyncio
test_case = EvalTestCase(
input="What is transfer learning?",
actual_output="Transfer learning reuses a pre-trained model on a new task.",
expected_output="Transfer learning is a technique where a model trained on one task is reused as the starting point for a model on a second task.",
retrieval_context=[
"Transfer learning is an ML technique where a model developed for one task is reused for a different task.", # Relevant
"The weather in Tokyo is sunny with temperatures around 25°C.", # Irrelevant
"Fine-tuning is a common approach in transfer learning where pre-trained weights are adjusted.", # Relevant
]
)
metric = ContextualPrecisionMetric(model="gpt-4o", threshold=0.7)
results = asyncio.run(evaluate([test_case], [metric]))
Cost¶
N LLM API calls, where N = number of context chunks. Each chunk is evaluated independently.
Interpretation¶
- 1.0 — all relevant chunks are ranked before irrelevant ones
- 0.5-0.8 — relevant chunks are mixed with irrelevant ones
- < 0.5 — irrelevant chunks are ranked higher than relevant ones (poor retrieval)