Skip to content

Configuration

Environment Variables

Eval AI Library uses environment variables for LLM provider authentication. Set the variables for the providers you intend to use.

LLM Providers

Variable Provider Required
OPENAI_API_KEY OpenAI For OpenAI models
ANTHROPIC_API_KEY Anthropic For Claude models
GOOGLE_API_KEY Google For Gemini models
AZURE_OPENAI_API_KEY Azure OpenAI For Azure deployments
AZURE_OPENAI_ENDPOINT Azure OpenAI For Azure deployments
AZURE_OPENAI_DEPLOYMENT Azure OpenAI For Azure deployments
OLLAMA_API_BASE_URL Ollama Optional (default: http://localhost:11434/v1)
OLLAMA_API_KEY Ollama Optional

Using .env Files

You can use a .env file with python-dotenv:

.env
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=AI...
from dotenv import load_dotenv
load_dotenv()

from eval_lib import evaluate, AnswerRelevancyMetric
# Now API keys are available

Model Specification

Models are specified using the provider:model_name format:

# OpenAI (default provider if no prefix)
model = "gpt-4o"
model = "openai:gpt-4o"

# Anthropic
model = "anthropic:claude-3-5-sonnet-latest"

# Google Gemini
model = "google:gemini-2.0-flash"

# Ollama (local)
model = "ollama:llama3"

# Azure OpenAI
model = "azure:gpt-4o"

You can also use LLMDescriptor for explicit configuration:

from eval_lib import LLMDescriptor, Provider

model = LLMDescriptor(provider=Provider.OPENAI, model="gpt-4o")
model = LLMDescriptor(provider=Provider.ANTHROPIC, model="claude-3-5-sonnet-latest")

Metric Configuration

Every metric accepts these common parameters:

Parameter Type Default Description
model str LLM model to use for evaluation
threshold float varies Minimum score to pass (0.0-1.0)
verbose bool False Enable detailed logging

Default Thresholds

Metric Default Threshold
Answer Relevancy 0.6
Answer Precision 0.6
Faithfulness 0.7
Contextual Relevancy 0.6
Contextual Precision 0.7
Contextual Recall 0.7
Bias Detection 0.8
Toxicity Detection 0.7
Tool Correctness 0.5
Task Success Rate 0.7
Role Adherence 0.7
Knowledge Retention 0.7
Tool Error Detection 0.7
All Security Metrics 0.7

Temperature Parameter

Metrics that use verdict aggregation accept a temperature parameter that controls scoring strictness:

# Strict scoring — penalizes poor verdicts heavily
metric = AnswerRelevancyMetric(model="gpt-4o", threshold=0.7, temperature=0.1)

# Balanced scoring (default for most metrics)
metric = AnswerRelevancyMetric(model="gpt-4o", threshold=0.7, temperature=0.5)

# Lenient scoring — forgiving of weak verdicts
metric = AnswerRelevancyMetric(model="gpt-4o", threshold=0.7, temperature=1.0)

See Score Aggregation for details on how temperature affects scoring.


Evaluation Options

The evaluate() function accepts these options:

results = await evaluate(
    test_cases=test_cases,       # List[EvalTestCase]
    metrics=metrics,             # List[MetricPattern]
    verbose=True,                # Show progress in console
    show_dashboard=False,        # Open dashboard after evaluation
    session_name="my-run"        # Name for dashboard session
)
Parameter Type Default Description
test_cases List[EvalTestCase] required Test cases to evaluate
metrics List[MetricPattern] required Metrics to apply
verbose bool True Show console progress
show_dashboard bool False Open dashboard in browser
session_name str None Session name for caching

Test Case Schema

EvalTestCase

from eval_lib import EvalTestCase

test_case = EvalTestCase(
    input="User's question",                    # Required
    actual_output="AI's response",              # Required
    expected_output="Reference answer",          # Optional
    retrieval_context=["context chunk 1", ...],  # Optional
    tools_called=["tool1", "tool2"],            # Optional (for agent metrics)
    expected_tools=["tool1", "tool2"],          # Optional (for agent metrics)
    reasoning="Chain of thought",               # Optional
    name="Test case label",                     # Optional
)

ConversationalEvalTestCase

from eval_lib import ConversationalEvalTestCase, EvalTestCase

conversation = ConversationalEvalTestCase(
    chatbot_role="System prompt / role description",  # Optional
    name="Conversation label",                        # Optional
    turns=[
        EvalTestCase(input="...", actual_output="..."),
        EvalTestCase(input="...", actual_output="..."),
    ]
)

Which fields are needed?

Different metrics require different fields. See each metric's documentation for its specific requirements.