Configuration¶

Environment Variables¶

Eval AI Library uses environment variables for LLM provider authentication. Set the variables for the providers you intend to use.

LLM Providers¶

Variable	Provider	Required
`OPENAI_API_KEY`	OpenAI	For OpenAI models
`ANTHROPIC_API_KEY`	Anthropic	For Claude models
`GOOGLE_API_KEY`	Google	For Gemini models
`AZURE_OPENAI_API_KEY`	Azure OpenAI	For Azure deployments
`AZURE_OPENAI_ENDPOINT`	Azure OpenAI	For Azure deployments
`AZURE_OPENAI_DEPLOYMENT`	Azure OpenAI	For Azure deployments
`OLLAMA_API_BASE_URL`	Ollama	Optional (default: `http://localhost:11434/v1`)
`OLLAMA_API_KEY`	Ollama	Optional

Using .env Files¶

You can use a .env file with python-dotenv:

.env

OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=AI...

from dotenv import load_dotenv
load_dotenv()

from eval_lib import evaluate, AnswerRelevancyMetric
# Now API keys are available

Model Specification¶

Models are specified using the provider:model_name format:

# OpenAI (default provider if no prefix)
model = "gpt-4o"
model = "openai:gpt-4o"

# Anthropic
model = "anthropic:claude-3-5-sonnet-latest"

# Google Gemini
model = "google:gemini-2.0-flash"

# Ollama (local)
model = "ollama:llama3"

# Azure OpenAI
model = "azure:gpt-4o"

You can also use LLMDescriptor for explicit configuration:

from eval_lib import LLMDescriptor, Provider

model = LLMDescriptor(provider=Provider.OPENAI, model="gpt-4o")
model = LLMDescriptor(provider=Provider.ANTHROPIC, model="claude-3-5-sonnet-latest")

Metric Configuration¶

Every metric accepts these common parameters:

Parameter	Type	Default	Description
`model`	`str`	—	LLM model to use for evaluation
`threshold`	`float`	varies	Minimum score to pass (0.0-1.0)
`verbose`	`bool`	`False`	Enable detailed logging

Default Thresholds¶

Metric	Default Threshold
Answer Relevancy	0.6
Answer Precision	0.6
Faithfulness	0.7
Contextual Relevancy	0.6
Contextual Precision	0.7
Contextual Recall	0.7
Bias Detection	0.8
Toxicity Detection	0.7
Tool Correctness	0.5
Task Success Rate	0.7
Role Adherence	0.7
Knowledge Retention	0.7
Tool Error Detection	0.7
All Security Metrics	0.7

Temperature Parameter¶

Metrics that use verdict aggregation accept a temperature parameter that controls scoring strictness:

# Strict scoring — penalizes poor verdicts heavily
metric = AnswerRelevancyMetric(model="gpt-4o", threshold=0.7, temperature=0.1)

# Balanced scoring (default for most metrics)
metric = AnswerRelevancyMetric(model="gpt-4o", threshold=0.7, temperature=0.5)

# Lenient scoring — forgiving of weak verdicts
metric = AnswerRelevancyMetric(model="gpt-4o", threshold=0.7, temperature=1.0)

See Score Aggregation for details on how temperature affects scoring.

Evaluation Options¶

The evaluate() function accepts these options:

results = await evaluate(
    test_cases=test_cases,       # List[EvalTestCase]
    metrics=metrics,             # List[MetricPattern]
    verbose=True,                # Show progress in console
    show_dashboard=False,        # Open dashboard after evaluation
    session_name="my-run"        # Name for dashboard session
)

Parameter	Type	Default	Description
`test_cases`	`List[EvalTestCase]`	required	Test cases to evaluate
`metrics`	`List[MetricPattern]`	required	Metrics to apply
`verbose`	`bool`	`True`	Show console progress
`show_dashboard`	`bool`	`False`	Open dashboard in browser
`session_name`	`str`	`None`	Session name for caching

Test Case Schema¶

EvalTestCase¶

from eval_lib import EvalTestCase

test_case = EvalTestCase(
    input="User's question",                    # Required
    actual_output="AI's response",              # Required
    expected_output="Reference answer",          # Optional
    retrieval_context=["context chunk 1", ...],  # Optional
    tools_called=["tool1", "tool2"],            # Optional (for agent metrics)
    expected_tools=["tool1", "tool2"],          # Optional (for agent metrics)
    reasoning="Chain of thought",               # Optional
    name="Test case label",                     # Optional
)

ConversationalEvalTestCase¶

from eval_lib import ConversationalEvalTestCase, EvalTestCase

conversation = ConversationalEvalTestCase(
    chatbot_role="System prompt / role description",  # Optional
    name="Conversation label",                        # Optional
    turns=[
        EvalTestCase(input="...", actual_output="..."),
        EvalTestCase(input="...", actual_output="..."),
    ]
)

Which fields are needed?

Different metrics require different fields. See each metric's documentation for its specific requirements.