Skip to content

Contextual Precision

The Contextual Precision metric measures the precision of retrieved context chunks — whether relevant chunks are ranked higher than irrelevant ones.

In RAG systems, the order of retrieved documents matters. Most LLMs pay more attention to context that appears earlier in the prompt. If irrelevant chunks are ranked above relevant ones, the generator may focus on the wrong information and produce a lower-quality answer. This metric uses a Precision@K formula from information retrieval to measure ranking quality.

How It Works

Uses a Precision@K formula inspired by information retrieval:

\[ \text{Context Precision@K} = \frac{\sum_{k=1}^{K} (\text{Precision@k} \times v_k)}{\text{Number of relevant chunks}} \]

Where \(v_k = 1\) if chunk \(k\) is relevant to the reference answer, \(0\) otherwise.

  1. Relevance Check — for each context chunk, determines if it's relevant to the expected output
  2. Ranking Evaluation — measures whether relevant chunks appear before irrelevant ones
  3. Precision Calculation — computes weighted precision based on chunk positions

Parameters

Parameter Type Default Description
model str required LLM model ("gpt-4o", "anthropic:claude-3-5-sonnet-latest", "google:gemini-2.0-flash", "ollama:llama3", or CustomLLMClient)
threshold float 0.7 Minimum score to pass
top_k int None Limit evaluation to top K chunks

Required Fields

Field Required
input Yes
actual_output Yes
expected_output Yes
retrieval_context Yes

Usage

from eval_lib import ContextualPrecisionMetric, EvalTestCase, evaluate
import asyncio

test_case = EvalTestCase(
    input="What is transfer learning?",
    actual_output="Transfer learning reuses a pre-trained model on a new task.",
    expected_output="Transfer learning is a technique where a model trained on one task is reused as the starting point for a model on a second task.",
    retrieval_context=[
        "Transfer learning is an ML technique where a model developed for one task is reused for a different task.",  # Relevant
        "The weather in Tokyo is sunny with temperatures around 25°C.",  # Irrelevant
        "Fine-tuning is a common approach in transfer learning where pre-trained weights are adjusted.",  # Relevant
    ]
)

metric = ContextualPrecisionMetric(model="gpt-4o", threshold=0.7)
results = asyncio.run(evaluate([test_case], [metric]))

Cost

N LLM API calls, where N = number of context chunks. Each chunk is evaluated independently.

Interpretation

  • 1.0 — all relevant chunks are ranked before irrelevant ones
  • 0.5-0.8 — relevant chunks are mixed with irrelevant ones
  • < 0.5 — irrelevant chunks are ranked higher than relevant ones (poor retrieval)