RAG Metrics¶
Eval AI Library provides a comprehensive suite of metrics for evaluating Retrieval-Augmented Generation (RAG) systems. These metrics assess different aspects of RAG quality — from answer relevancy to factual faithfulness and retrieval quality.
Available Metrics¶
| Metric | What It Measures | LLM Calls | Default Threshold |
|---|---|---|---|
| Answer Relevancy | How well the answer addresses the user's intent | 4 | 0.6 |
| Answer Precision | How precisely the answer matches expected output | 0 | 0.6 |
| Faithfulness | Factual consistency with retrieval context | 3 | 0.7 |
| Contextual Relevancy | How well retrieved context supports the question | 3 | 0.6 |
| Contextual Precision | Precision of retrieved context chunks | N per chunk | 0.7 |
| Contextual Recall | Coverage of reference answer by context | 2 | 0.7 |
| Bias Detection | Bias and prejudice in AI output | 1 | 0.8 |
| Toxicity Detection | Toxicity level in AI output | 1 | 0.7 |
How RAG Evaluation Works¶
A RAG system has two key components: a Retriever (finds relevant documents from a knowledge base) and a Generator (LLM that produces an answer based on retrieved context). Each component can fail independently, and the metrics help identify weaknesses at every stage.
graph TD
Q["Question (input)"] --> R[Retriever]
R --> C["Context Chunks (retrieval_context)"]
C --> G[Generator LLM]
Q --> G
G --> A["Answer (actual_output)"]
subgraph retriever_metrics ["Retriever Metrics"]
M1[Contextual Relevancy]
M2[Contextual Precision]
M3[Contextual Recall]
end
subgraph answer_metrics ["Answer Metrics"]
M4[Answer Relevancy]
M5[Faithfulness]
M6[Answer Precision]
end
subgraph safety_metrics ["Safety Metrics"]
M7[Bias]
M8[Toxicity]
end
C -.-> M1
C -.-> M2
C -.-> M3
A -.-> M4
A -.-> M5
A -.-> M6
A -.-> M7
A -.-> M8 Choosing Metrics¶
Retriever Quality¶
Use Contextual Relevancy, Contextual Precision, and Contextual Recall to evaluate how well your retriever finds the right information.
- Contextual Relevancy — checks if retrieved documents are relevant to the question (detects noise)
- Contextual Precision — checks if relevant documents are ranked higher than irrelevant ones
- Contextual Recall — checks if all important information is found (detects missing context)
Answer Quality¶
Use Answer Relevancy and Answer Precision to evaluate how well the generated answer addresses the question.
- Answer Relevancy — evaluates from the user's intent perspective (no reference answer needed)
- Answer Precision — compares with a reference answer (requires
expected_output)
Factual Accuracy¶
Use Faithfulness to ensure the answer is grounded in the retrieval context and doesn't hallucinate. This is critical for systems handling legal, medical, or financial data.
Safety¶
Use Bias and Toxicity to ensure outputs are safe and unbiased. Especially important for customer-facing chatbots.
Verdict System¶
Most RAG metrics use a 5-level verdict system for nuanced evaluation:
| Verdict | Weight | Meaning |
|---|---|---|
fully | 1.0 | Criterion fully satisfied |
mostly | 0.9 | Largely satisfied with minor gaps |
partial | 0.7 | Partially satisfied |
minor | 0.3 | Minimally addressed |
none | 0.0 | Not satisfied at all |
Verdicts are aggregated using Temperature-Controlled Score Aggregation for a final score between 0.0 and 1.0.
Quick Example¶
import asyncio
from eval_lib import (
evaluate,
EvalTestCase,
AnswerRelevancyMetric,
FaithfulnessMetric,
ContextualRelevancyMetric,
ContextualRecallMetric,
)
test_case = EvalTestCase(
input="What causes climate change?",
actual_output="Climate change is primarily caused by greenhouse gas emissions from burning fossil fuels.",
expected_output="The main cause of climate change is the emission of greenhouse gases, particularly from fossil fuel combustion.",
retrieval_context=[
"Burning fossil fuels releases carbon dioxide and other greenhouse gases into the atmosphere.",
"These greenhouse gases trap heat, leading to global warming and climate change.",
]
)
metrics = [
AnswerRelevancyMetric(model="gpt-4o", threshold=0.7),
FaithfulnessMetric(model="gpt-4o", threshold=0.7),
ContextualRelevancyMetric(model="gpt-4o", threshold=0.6),
ContextualRecallMetric(model="gpt-4o", threshold=0.7),
]
results = asyncio.run(evaluate([test_case], metrics))