Knowledge Retention¶

The Knowledge Retention metric evaluates how well the AI remembers and uses context from earlier parts of a conversation.

In multi-turn conversations, users share information progressively — names, preferences, constraints, prior decisions. A well-designed conversational AI should retain this context and apply it in subsequent turns without asking the user to repeat themselves. Forgetting context leads to frustrating user experiences (e.g., a food recommendation bot that forgets about a stated allergy, or a support agent that asks for the order number twice).

This metric analyzes the full conversation history and evaluates whether the AI's later responses demonstrate awareness of information shared in earlier turns.

How It Works¶

Dialogue Formatting — formats the full conversation history
Retention Assessment — evaluates whether the AI retains and applies earlier context
Verdict Generation — single verdict for overall retention quality

Parameters¶

Parameter	Type	Default	Description
`model`	`str`	required	LLM model (`"gpt-4o"`, `"anthropic:claude-3-5-sonnet-latest"`, `"google:gemini-2.0-flash"`, `"ollama:llama3"`, or `CustomLLMClient`)
`threshold`	`float`	`0.7`	Minimum score to pass
`temperature`	`float`	`0.5`	Aggregation strictness

Required Fields¶

Requires ConversationalEvalTestCase.

Usage¶

from eval_lib import (
    KnowledgeRetentionMetric,
    ConversationalEvalTestCase,
    EvalTestCase,
    evaluate_conversations,
)
import asyncio

conversation = ConversationalEvalTestCase(
    turns=[
        EvalTestCase(
            input="My name is Alice and I'm allergic to peanuts.",
            actual_output="Nice to meet you, Alice! I'll make note of your peanut allergy."
        ),
        EvalTestCase(
            input="Can you suggest a dessert?",
            actual_output="How about a chocolate lava cake? It's peanut-free, so it's safe for you, Alice!"
        ),
        EvalTestCase(
            input="What about a snack?",
            actual_output="I'd suggest fruit and cheese. No peanuts involved!"
            # AI remembers the allergy from turn 1
        ),
    ]
)

metric = KnowledgeRetentionMetric(model="gpt-4o", threshold=0.7)
results = asyncio.run(evaluate_conversations([conversation], [metric]))

Cost¶

2 LLM API calls per evaluation.

When to Use¶

Long multi-turn conversations
Personal assistants that need to remember user preferences
Customer support scenarios with context carryover
Any conversational AI where forgetting context would degrade experience