RAG Metrics¶

Eval AI Library provides a comprehensive suite of metrics for evaluating Retrieval-Augmented Generation (RAG) systems. These metrics assess different aspects of RAG quality — from answer relevancy to factual faithfulness and retrieval quality.

Available Metrics¶

Metric	What It Measures	LLM Calls	Default Threshold
Answer Relevancy	How well the answer addresses the user's intent	4	0.6
Answer Precision	How precisely the answer matches expected output	0	0.6
Faithfulness	Factual consistency with retrieval context	3	0.7
Contextual Relevancy	How well retrieved context supports the question	3	0.6
Contextual Precision	Precision of retrieved context chunks	N per chunk	0.7
Contextual Recall	Coverage of reference answer by context	2	0.7
Bias Detection	Bias and prejudice in AI output	1	0.8
Toxicity Detection	Toxicity level in AI output	1	0.7

How RAG Evaluation Works¶

A RAG system has two key components: a Retriever (finds relevant documents from a knowledge base) and a Generator (LLM that produces an answer based on retrieved context). Each component can fail independently, and the metrics help identify weaknesses at every stage.

graph TD
    Q["Question (input)"] --> R[Retriever]
    R --> C["Context Chunks (retrieval_context)"]
    C --> G[Generator LLM]
    Q --> G
    G --> A["Answer (actual_output)"]

    subgraph retriever_metrics ["Retriever Metrics"]
        M1[Contextual Relevancy]
        M2[Contextual Precision]
        M3[Contextual Recall]
    end

    subgraph answer_metrics ["Answer Metrics"]
        M4[Answer Relevancy]
        M5[Faithfulness]
        M6[Answer Precision]
    end

    subgraph safety_metrics ["Safety Metrics"]
        M7[Bias]
        M8[Toxicity]
    end

    C -.-> M1
    C -.-> M2
    C -.-> M3
    A -.-> M4
    A -.-> M5
    A -.-> M6
    A -.-> M7
    A -.-> M8

Choosing Metrics¶

Retriever Quality¶

Use Contextual Relevancy, Contextual Precision, and Contextual Recall to evaluate how well your retriever finds the right information.

Contextual Relevancy — checks if retrieved documents are relevant to the question (detects noise)
Contextual Precision — checks if relevant documents are ranked higher than irrelevant ones
Contextual Recall — checks if all important information is found (detects missing context)

Answer Quality¶

Use Answer Relevancy and Answer Precision to evaluate how well the generated answer addresses the question.

Answer Relevancy — evaluates from the user's intent perspective (no reference answer needed)
Answer Precision — compares with a reference answer (requires expected_output)

Factual Accuracy¶

Use Faithfulness to ensure the answer is grounded in the retrieval context and doesn't hallucinate. This is critical for systems handling legal, medical, or financial data.

Safety¶

Use Bias and Toxicity to ensure outputs are safe and unbiased. Especially important for customer-facing chatbots.

Verdict System¶

Most RAG metrics use a 5-level verdict system for nuanced evaluation:

Verdict	Weight	Meaning
`fully`	1.0	Criterion fully satisfied
`mostly`	0.9	Largely satisfied with minor gaps
`partial`	0.7	Partially satisfied
`minor`	0.3	Minimally addressed
`none`	0.0	Not satisfied at all

Verdicts are aggregated using Temperature-Controlled Score Aggregation for a final score between 0.0 and 1.0.

Quick Example¶

import asyncio
from eval_lib import (
    evaluate,
    EvalTestCase,
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    ContextualRelevancyMetric,
    ContextualRecallMetric,
)

test_case = EvalTestCase(
    input="What causes climate change?",
    actual_output="Climate change is primarily caused by greenhouse gas emissions from burning fossil fuels.",
    expected_output="The main cause of climate change is the emission of greenhouse gases, particularly from fossil fuel combustion.",
    retrieval_context=[
        "Burning fossil fuels releases carbon dioxide and other greenhouse gases into the atmosphere.",
        "These greenhouse gases trap heat, leading to global warming and climate change.",
    ]
)

metrics = [
    AnswerRelevancyMetric(model="gpt-4o", threshold=0.7),
    FaithfulnessMetric(model="gpt-4o", threshold=0.7),
    ContextualRelevancyMetric(model="gpt-4o", threshold=0.6),
    ContextualRecallMetric(model="gpt-4o", threshold=0.7),
]

results = asyncio.run(evaluate([test_case], metrics))