Skip to content

Faithfulness

The Faithfulness metric evaluates factual consistency between the AI's answer and the retrieval context. It detects hallucinations — cases where the answer contains information not supported by the provided context.

How It Works

graph TD
    A[Actual Output] --> B[1. Extract Factual Statements]
    B --> C[2. Check Each Statement Against Context]
    D[Retrieval Context] --> C
    C --> E[3. Generate Verdicts]
    E --> F[4. Aggregate via Softmax]
    F --> G[Final Score 0.0-1.0]
  1. Statement Extraction — breaks the answer into individual factual claims
  2. Faithfulness Check — evaluates each claim against the retrieval context
  3. Verdict Generation — assigns a 5-level verdict to each statement
  4. Score Aggregation — combines verdicts using temperature-controlled softmax

Parameters

Parameter Type Default Description
model str required LLM model ("gpt-4o", "anthropic:claude-3-5-sonnet-latest", "google:gemini-2.0-flash", "ollama:llama3", or CustomLLMClient)
threshold float 0.7 Minimum score to pass
temperature float 0.5 Aggregation strictness

Required Fields

Field Required
input Yes
actual_output Yes
retrieval_context Yes

Retrieval context required

Faithfulness checks if the answer is grounded in the context, so retrieval_context must be provided.

Usage

from eval_lib import FaithfulnessMetric, EvalTestCase, evaluate
import asyncio

test_case = EvalTestCase(
    input="When was Python created?",
    actual_output="Python was created by Guido van Rossum and first released in 1991. It is one of the most popular programming languages today.",
    retrieval_context=[
        "Python was conceived in the late 1980s by Guido van Rossum at CWI in the Netherlands. The first version (0.9.0) was released in February 1991.",
        "Python has consistently ranked among the top programming languages since the 2010s."
    ]
)

metric = FaithfulnessMetric(
    model="gpt-4o",
    threshold=0.7,
    temperature=0.5
)

results = asyncio.run(evaluate([test_case], [metric]))

Detecting Hallucinations

Faithful Answer (High Score)

# Context says: "The Earth orbits the Sun at about 150 million km"
# Answer: "The Earth is approximately 150 million kilometers from the Sun."
# Score: ~0.95 (statement directly supported by context)

Hallucinated Answer (Low Score)

# Context says: "The Earth orbits the Sun at about 150 million km"
# Answer: "The Earth is 150 million km from the Sun and has 2 moons."
# Score: ~0.50 (second claim not in context — hallucination)

Cost

3 LLM API calls per evaluation:

  1. Extract factual statements from answer
  2. Generate faithfulness verdicts
  3. Generate explanation

Tips

  • Higher threshold (0.8-0.9) for medical, legal, or financial applications where hallucinations are dangerous
  • Lower threshold (0.5-0.6) for creative or conversational applications where some elaboration is acceptable
  • Use temperature=0.1 for strict mode where any unsupported claim is heavily penalized