Knowledge Retention¶
The Knowledge Retention metric evaluates how well the AI remembers and uses context from earlier parts of a conversation.
In multi-turn conversations, users share information progressively — names, preferences, constraints, prior decisions. A well-designed conversational AI should retain this context and apply it in subsequent turns without asking the user to repeat themselves. Forgetting context leads to frustrating user experiences (e.g., a food recommendation bot that forgets about a stated allergy, or a support agent that asks for the order number twice).
This metric analyzes the full conversation history and evaluates whether the AI's later responses demonstrate awareness of information shared in earlier turns.
How It Works¶
- Dialogue Formatting — formats the full conversation history
- Retention Assessment — evaluates whether the AI retains and applies earlier context
- Verdict Generation — single verdict for overall retention quality
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | required | LLM model ("gpt-4o", "anthropic:claude-3-5-sonnet-latest", "google:gemini-2.0-flash", "ollama:llama3", or CustomLLMClient) |
threshold | float | 0.7 | Minimum score to pass |
temperature | float | 0.5 | Aggregation strictness |
Required Fields¶
Requires ConversationalEvalTestCase.
Usage¶
from eval_lib import (
KnowledgeRetentionMetric,
ConversationalEvalTestCase,
EvalTestCase,
evaluate_conversations,
)
import asyncio
conversation = ConversationalEvalTestCase(
turns=[
EvalTestCase(
input="My name is Alice and I'm allergic to peanuts.",
actual_output="Nice to meet you, Alice! I'll make note of your peanut allergy."
),
EvalTestCase(
input="Can you suggest a dessert?",
actual_output="How about a chocolate lava cake? It's peanut-free, so it's safe for you, Alice!"
),
EvalTestCase(
input="What about a snack?",
actual_output="I'd suggest fruit and cheese. No peanuts involved!"
# AI remembers the allergy from turn 1
),
]
)
metric = KnowledgeRetentionMetric(model="gpt-4o", threshold=0.7)
results = asyncio.run(evaluate_conversations([conversation], [metric]))
Cost¶
2 LLM API calls per evaluation.
When to Use¶
- Long multi-turn conversations
- Personal assistants that need to remember user preferences
- Customer support scenarios with context carryover
- Any conversational AI where forgetting context would degrade experience