Custom Metrics¶
The Custom Evaluation Metric lets you define your own evaluation criteria without writing metric code. It uses LLM-based verdict generation with the same temperature-controlled aggregation as built-in metrics.
How It Works¶
- Criteria Processing — if you provide a high-level description, the LLM auto-generates specific evaluation criteria
- Verdict Generation — evaluates the output against each criterion using the 5-level verdict scale
- Score Aggregation — combines verdicts using temperature-controlled softmax
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | required | LLM model ("gpt-4o", "anthropic:claude-3-5-sonnet-latest", "google:gemini-2.0-flash", "ollama:llama3", or CustomLLMClient) |
threshold | float | required | Minimum score to pass |
name | str | required | Name for the metric |
criteria | str | required | What to evaluate (high-level description) |
evaluation_steps | list[str] | None | Specific evaluation steps (auto-generated if not provided) |
temperature | float | 0.8 | Aggregation strictness |
Usage¶
Simple — Auto-Generated Steps¶
from eval_lib import CustomEvalMetric, EvalTestCase, evaluate
import asyncio
metric = CustomEvalMetric(
model="gpt-4o",
threshold=0.7,
name="TechnicalAccuracy",
criteria="Evaluate the technical accuracy and depth of the AI's explanation of programming concepts."
)
test_case = EvalTestCase(
input="Explain how garbage collection works in Python.",
actual_output="Python uses reference counting as its primary garbage collection mechanism. Each object maintains a count of references pointing to it. When the count drops to zero, the memory is freed. Python also has a generational garbage collector to handle circular references."
)
results = asyncio.run(evaluate([test_case], [metric]))
Advanced — Custom Steps¶
metric = CustomEvalMetric(
model="gpt-4o",
threshold=0.7,
name="CustomerServiceQuality",
criteria="Evaluate the quality of customer service responses",
evaluation_steps=[
"Check if the response acknowledges the customer's issue",
"Verify that a solution or next step is provided",
"Assess the tone for empathy and professionalism",
"Check if the response is clear and free of jargon",
"Verify the response is concise but complete"
],
temperature=0.5
)
Domain-Specific Examples¶
# Medical Information Quality
medical_metric = CustomEvalMetric(
model="gpt-4o",
threshold=0.8,
name="MedicalInfoQuality",
criteria="Evaluate the accuracy, safety, and appropriateness of medical information. Check for disclaimers about consulting healthcare professionals.",
temperature=0.2 # Strict for safety
)
# Code Review Quality
code_review_metric = CustomEvalMetric(
model="gpt-4o",
threshold=0.7,
name="CodeReviewQuality",
criteria="Evaluate the quality of code review feedback: completeness, actionability, and identification of potential bugs or improvements."
)
# Educational Content
education_metric = CustomEvalMetric(
model="anthropic:claude-3-5-sonnet-latest",
threshold=0.7,
name="EducationalClarity",
criteria="Evaluate educational content for clarity, progressive complexity, use of examples, and engagement."
)
Scoring¶
The standard 5-level verdict system is applied to each evaluation criterion:
| Verdict | Weight |
|---|---|
fully | 1.0 |
mostly | 0.9 |
partial | 0.7 |
minor | 0.3 |
none | 0.0 |
Cost¶
1-2 LLM API calls per evaluation (1 if steps are provided, 2 if auto-generated).