Faithfulness¶

The Faithfulness metric evaluates factual consistency between the AI's answer and the retrieval context. It detects hallucinations — cases where the answer contains information not supported by the provided context.

How It Works¶

graph TD
    A[Actual Output] --> B[1. Extract Factual Statements]
    B --> C[2. Check Each Statement Against Context]
    D[Retrieval Context] --> C
    C --> E[3. Generate Verdicts]
    E --> F[4. Aggregate via Softmax]
    F --> G[Final Score 0.0-1.0]

Statement Extraction — breaks the answer into individual factual claims
Faithfulness Check — evaluates each claim against the retrieval context
Verdict Generation — assigns a 5-level verdict to each statement
Score Aggregation — combines verdicts using temperature-controlled softmax

Parameters¶

Parameter	Type	Default	Description
`model`	`str`	required	LLM model (`"gpt-4o"`, `"anthropic:claude-3-5-sonnet-latest"`, `"google:gemini-2.0-flash"`, `"ollama:llama3"`, or `CustomLLMClient`)
`threshold`	`float`	`0.7`	Minimum score to pass
`temperature`	`float`	`0.5`	Aggregation strictness

Required Fields¶

Field	Required
`input`	Yes
`actual_output`	Yes
`retrieval_context`	Yes

Retrieval context required

Faithfulness checks if the answer is grounded in the context, so retrieval_context must be provided.

Usage¶

from eval_lib import FaithfulnessMetric, EvalTestCase, evaluate
import asyncio

test_case = EvalTestCase(
    input="When was Python created?",
    actual_output="Python was created by Guido van Rossum and first released in 1991. It is one of the most popular programming languages today.",
    retrieval_context=[
        "Python was conceived in the late 1980s by Guido van Rossum at CWI in the Netherlands. The first version (0.9.0) was released in February 1991.",
        "Python has consistently ranked among the top programming languages since the 2010s."
    ]
)

metric = FaithfulnessMetric(
    model="gpt-4o",
    threshold=0.7,
    temperature=0.5
)

results = asyncio.run(evaluate([test_case], [metric]))

Detecting Hallucinations¶

Faithful Answer (High Score)¶

# Context says: "The Earth orbits the Sun at about 150 million km"
# Answer: "The Earth is approximately 150 million kilometers from the Sun."
# Score: ~0.95 (statement directly supported by context)

Hallucinated Answer (Low Score)¶

# Context says: "The Earth orbits the Sun at about 150 million km"
# Answer: "The Earth is 150 million km from the Sun and has 2 moons."
# Score: ~0.50 (second claim not in context — hallucination)

Cost¶

3 LLM API calls per evaluation:

Extract factual statements from answer
Generate faithfulness verdicts
Generate explanation

Tips¶

Higher threshold (0.8-0.9) for medical, legal, or financial applications where hallucinations are dangerous
Lower threshold (0.5-0.6) for creative or conversational applications where some elaboration is acceptable
Use temperature=0.1 for strict mode where any unsupported claim is heavily penalized