Skip to content

Knowledge Retention

The Knowledge Retention metric evaluates how well the AI remembers and uses context from earlier parts of a conversation.

In multi-turn conversations, users share information progressively — names, preferences, constraints, prior decisions. A well-designed conversational AI should retain this context and apply it in subsequent turns without asking the user to repeat themselves. Forgetting context leads to frustrating user experiences (e.g., a food recommendation bot that forgets about a stated allergy, or a support agent that asks for the order number twice).

This metric analyzes the full conversation history and evaluates whether the AI's later responses demonstrate awareness of information shared in earlier turns.

How It Works

  1. Dialogue Formatting — formats the full conversation history
  2. Retention Assessment — evaluates whether the AI retains and applies earlier context
  3. Verdict Generation — single verdict for overall retention quality

Parameters

Parameter Type Default Description
model str required LLM model ("gpt-4o", "anthropic:claude-3-5-sonnet-latest", "google:gemini-2.0-flash", "ollama:llama3", or CustomLLMClient)
threshold float 0.7 Minimum score to pass
temperature float 0.5 Aggregation strictness

Required Fields

Requires ConversationalEvalTestCase.

Usage

from eval_lib import (
    KnowledgeRetentionMetric,
    ConversationalEvalTestCase,
    EvalTestCase,
    evaluate_conversations,
)
import asyncio

conversation = ConversationalEvalTestCase(
    turns=[
        EvalTestCase(
            input="My name is Alice and I'm allergic to peanuts.",
            actual_output="Nice to meet you, Alice! I'll make note of your peanut allergy."
        ),
        EvalTestCase(
            input="Can you suggest a dessert?",
            actual_output="How about a chocolate lava cake? It's peanut-free, so it's safe for you, Alice!"
        ),
        EvalTestCase(
            input="What about a snack?",
            actual_output="I'd suggest fruit and cheese. No peanuts involved!"
            # AI remembers the allergy from turn 1
        ),
    ]
)

metric = KnowledgeRetentionMetric(model="gpt-4o", threshold=0.7)
results = asyncio.run(evaluate_conversations([conversation], [metric]))

Cost

2 LLM API calls per evaluation.

When to Use

  • Long multi-turn conversations
  • Personal assistants that need to remember user preferences
  • Customer support scenarios with context carryover
  • Any conversational AI where forgetting context would degrade experience