Custom Metrics¶

The Custom Evaluation Metric lets you define your own evaluation criteria without writing metric code. It uses LLM-based verdict generation with the same temperature-controlled aggregation as built-in metrics.

How It Works¶

Criteria Processing — if you provide a high-level description, the LLM auto-generates specific evaluation criteria
Verdict Generation — evaluates the output against each criterion using the 5-level verdict scale
Score Aggregation — combines verdicts using temperature-controlled softmax

Parameters¶

Parameter	Type	Default	Description
`model`	`str`	required	LLM model (`"gpt-4o"`, `"anthropic:claude-3-5-sonnet-latest"`, `"google:gemini-2.0-flash"`, `"ollama:llama3"`, or `CustomLLMClient`)
`threshold`	`float`	required	Minimum score to pass
`name`	`str`	required	Name for the metric
`criteria`	`str`	required	What to evaluate (high-level description)
`evaluation_steps`	`list[str]`	`None`	Specific evaluation steps (auto-generated if not provided)
`temperature`	`float`	`0.8`	Aggregation strictness

Usage¶

Simple — Auto-Generated Steps¶

from eval_lib import CustomEvalMetric, EvalTestCase, evaluate
import asyncio

metric = CustomEvalMetric(
    model="gpt-4o",
    threshold=0.7,
    name="TechnicalAccuracy",
    criteria="Evaluate the technical accuracy and depth of the AI's explanation of programming concepts."
)

test_case = EvalTestCase(
    input="Explain how garbage collection works in Python.",
    actual_output="Python uses reference counting as its primary garbage collection mechanism. Each object maintains a count of references pointing to it. When the count drops to zero, the memory is freed. Python also has a generational garbage collector to handle circular references."
)

results = asyncio.run(evaluate([test_case], [metric]))

Advanced — Custom Steps¶

metric = CustomEvalMetric(
    model="gpt-4o",
    threshold=0.7,
    name="CustomerServiceQuality",
    criteria="Evaluate the quality of customer service responses",
    evaluation_steps=[
        "Check if the response acknowledges the customer's issue",
        "Verify that a solution or next step is provided",
        "Assess the tone for empathy and professionalism",
        "Check if the response is clear and free of jargon",
        "Verify the response is concise but complete"
    ],
    temperature=0.5
)

Domain-Specific Examples¶

# Medical Information Quality
medical_metric = CustomEvalMetric(
    model="gpt-4o",
    threshold=0.8,
    name="MedicalInfoQuality",
    criteria="Evaluate the accuracy, safety, and appropriateness of medical information. Check for disclaimers about consulting healthcare professionals.",
    temperature=0.2  # Strict for safety
)

# Code Review Quality
code_review_metric = CustomEvalMetric(
    model="gpt-4o",
    threshold=0.7,
    name="CodeReviewQuality",
    criteria="Evaluate the quality of code review feedback: completeness, actionability, and identification of potential bugs or improvements."
)

# Educational Content
education_metric = CustomEvalMetric(
    model="anthropic:claude-3-5-sonnet-latest",
    threshold=0.7,
    name="EducationalClarity",
    criteria="Evaluate educational content for clarity, progressive complexity, use of examples, and engagement."
)

Scoring¶

The standard 5-level verdict system is applied to each evaluation criterion:

Verdict	Weight
`fully`	1.0
`mostly`	0.9
`partial`	0.7
`minor`	0.3
`none`	0.0

Cost¶

1-2 LLM API calls per evaluation (1 if steps are provided, 2 if auto-generated).