Configuration¶
Environment Variables¶
Eval AI Library uses environment variables for LLM provider authentication. Set the variables for the providers you intend to use.
LLM Providers¶
| Variable | Provider | Required |
|---|---|---|
OPENAI_API_KEY | OpenAI | For OpenAI models |
ANTHROPIC_API_KEY | Anthropic | For Claude models |
GOOGLE_API_KEY | For Gemini models | |
AZURE_OPENAI_API_KEY | Azure OpenAI | For Azure deployments |
AZURE_OPENAI_ENDPOINT | Azure OpenAI | For Azure deployments |
AZURE_OPENAI_DEPLOYMENT | Azure OpenAI | For Azure deployments |
OLLAMA_API_BASE_URL | Ollama | Optional (default: http://localhost:11434/v1) |
OLLAMA_API_KEY | Ollama | Optional |
Using .env Files¶
You can use a .env file with python-dotenv:
from dotenv import load_dotenv
load_dotenv()
from eval_lib import evaluate, AnswerRelevancyMetric
# Now API keys are available
Model Specification¶
Models are specified using the provider:model_name format:
# OpenAI (default provider if no prefix)
model = "gpt-4o"
model = "openai:gpt-4o"
# Anthropic
model = "anthropic:claude-3-5-sonnet-latest"
# Google Gemini
model = "google:gemini-2.0-flash"
# Ollama (local)
model = "ollama:llama3"
# Azure OpenAI
model = "azure:gpt-4o"
You can also use LLMDescriptor for explicit configuration:
from eval_lib import LLMDescriptor, Provider
model = LLMDescriptor(provider=Provider.OPENAI, model="gpt-4o")
model = LLMDescriptor(provider=Provider.ANTHROPIC, model="claude-3-5-sonnet-latest")
Metric Configuration¶
Every metric accepts these common parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | — | LLM model to use for evaluation |
threshold | float | varies | Minimum score to pass (0.0-1.0) |
verbose | bool | False | Enable detailed logging |
Default Thresholds¶
| Metric | Default Threshold |
|---|---|
| Answer Relevancy | 0.6 |
| Answer Precision | 0.6 |
| Faithfulness | 0.7 |
| Contextual Relevancy | 0.6 |
| Contextual Precision | 0.7 |
| Contextual Recall | 0.7 |
| Bias Detection | 0.8 |
| Toxicity Detection | 0.7 |
| Tool Correctness | 0.5 |
| Task Success Rate | 0.7 |
| Role Adherence | 0.7 |
| Knowledge Retention | 0.7 |
| Tool Error Detection | 0.7 |
| All Security Metrics | 0.7 |
Temperature Parameter¶
Metrics that use verdict aggregation accept a temperature parameter that controls scoring strictness:
# Strict scoring — penalizes poor verdicts heavily
metric = AnswerRelevancyMetric(model="gpt-4o", threshold=0.7, temperature=0.1)
# Balanced scoring (default for most metrics)
metric = AnswerRelevancyMetric(model="gpt-4o", threshold=0.7, temperature=0.5)
# Lenient scoring — forgiving of weak verdicts
metric = AnswerRelevancyMetric(model="gpt-4o", threshold=0.7, temperature=1.0)
See Score Aggregation for details on how temperature affects scoring.
Evaluation Options¶
The evaluate() function accepts these options:
results = await evaluate(
test_cases=test_cases, # List[EvalTestCase]
metrics=metrics, # List[MetricPattern]
verbose=True, # Show progress in console
show_dashboard=False, # Open dashboard after evaluation
session_name="my-run" # Name for dashboard session
)
| Parameter | Type | Default | Description |
|---|---|---|---|
test_cases | List[EvalTestCase] | required | Test cases to evaluate |
metrics | List[MetricPattern] | required | Metrics to apply |
verbose | bool | True | Show console progress |
show_dashboard | bool | False | Open dashboard in browser |
session_name | str | None | Session name for caching |
Test Case Schema¶
EvalTestCase¶
from eval_lib import EvalTestCase
test_case = EvalTestCase(
input="User's question", # Required
actual_output="AI's response", # Required
expected_output="Reference answer", # Optional
retrieval_context=["context chunk 1", ...], # Optional
tools_called=["tool1", "tool2"], # Optional (for agent metrics)
expected_tools=["tool1", "tool2"], # Optional (for agent metrics)
reasoning="Chain of thought", # Optional
name="Test case label", # Optional
)
ConversationalEvalTestCase¶
from eval_lib import ConversationalEvalTestCase, EvalTestCase
conversation = ConversationalEvalTestCase(
chatbot_role="System prompt / role description", # Optional
name="Conversation label", # Optional
turns=[
EvalTestCase(input="...", actual_output="..."),
EvalTestCase(input="...", actual_output="..."),
]
)
Which fields are needed?
Different metrics require different fields. See each metric's documentation for its specific requirements.