Faithfulness¶
The Faithfulness metric evaluates factual consistency between the AI's answer and the retrieval context. It detects hallucinations — cases where the answer contains information not supported by the provided context.
How It Works¶
graph TD
A[Actual Output] --> B[1. Extract Factual Statements]
B --> C[2. Check Each Statement Against Context]
D[Retrieval Context] --> C
C --> E[3. Generate Verdicts]
E --> F[4. Aggregate via Softmax]
F --> G[Final Score 0.0-1.0] - Statement Extraction — breaks the answer into individual factual claims
- Faithfulness Check — evaluates each claim against the retrieval context
- Verdict Generation — assigns a 5-level verdict to each statement
- Score Aggregation — combines verdicts using temperature-controlled softmax
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | required | LLM model ("gpt-4o", "anthropic:claude-3-5-sonnet-latest", "google:gemini-2.0-flash", "ollama:llama3", or CustomLLMClient) |
threshold | float | 0.7 | Minimum score to pass |
temperature | float | 0.5 | Aggregation strictness |
Required Fields¶
| Field | Required |
|---|---|
input | Yes |
actual_output | Yes |
retrieval_context | Yes |
Retrieval context required
Faithfulness checks if the answer is grounded in the context, so retrieval_context must be provided.
Usage¶
from eval_lib import FaithfulnessMetric, EvalTestCase, evaluate
import asyncio
test_case = EvalTestCase(
input="When was Python created?",
actual_output="Python was created by Guido van Rossum and first released in 1991. It is one of the most popular programming languages today.",
retrieval_context=[
"Python was conceived in the late 1980s by Guido van Rossum at CWI in the Netherlands. The first version (0.9.0) was released in February 1991.",
"Python has consistently ranked among the top programming languages since the 2010s."
]
)
metric = FaithfulnessMetric(
model="gpt-4o",
threshold=0.7,
temperature=0.5
)
results = asyncio.run(evaluate([test_case], [metric]))
Detecting Hallucinations¶
Faithful Answer (High Score)¶
# Context says: "The Earth orbits the Sun at about 150 million km"
# Answer: "The Earth is approximately 150 million kilometers from the Sun."
# Score: ~0.95 (statement directly supported by context)
Hallucinated Answer (Low Score)¶
# Context says: "The Earth orbits the Sun at about 150 million km"
# Answer: "The Earth is 150 million km from the Sun and has 2 moons."
# Score: ~0.50 (second claim not in context — hallucination)
Cost¶
3 LLM API calls per evaluation:
- Extract factual statements from answer
- Generate faithfulness verdicts
- Generate explanation
Tips¶
- Higher threshold (0.8-0.9) for medical, legal, or financial applications where hallucinations are dangerous
- Lower threshold (0.5-0.6) for creative or conversational applications where some elaboration is acceptable
- Use
temperature=0.1for strict mode where any unsupported claim is heavily penalized