Skip to content

Custom Metrics

The Custom Evaluation Metric lets you define your own evaluation criteria without writing metric code. It uses LLM-based verdict generation with the same temperature-controlled aggregation as built-in metrics.

How It Works

  1. Criteria Processing — if you provide a high-level description, the LLM auto-generates specific evaluation criteria
  2. Verdict Generation — evaluates the output against each criterion using the 5-level verdict scale
  3. Score Aggregation — combines verdicts using temperature-controlled softmax

Parameters

Parameter Type Default Description
model str required LLM model ("gpt-4o", "anthropic:claude-3-5-sonnet-latest", "google:gemini-2.0-flash", "ollama:llama3", or CustomLLMClient)
threshold float required Minimum score to pass
name str required Name for the metric
criteria str required What to evaluate (high-level description)
evaluation_steps list[str] None Specific evaluation steps (auto-generated if not provided)
temperature float 0.8 Aggregation strictness

Usage

Simple — Auto-Generated Steps

from eval_lib import CustomEvalMetric, EvalTestCase, evaluate
import asyncio

metric = CustomEvalMetric(
    model="gpt-4o",
    threshold=0.7,
    name="TechnicalAccuracy",
    criteria="Evaluate the technical accuracy and depth of the AI's explanation of programming concepts."
)

test_case = EvalTestCase(
    input="Explain how garbage collection works in Python.",
    actual_output="Python uses reference counting as its primary garbage collection mechanism. Each object maintains a count of references pointing to it. When the count drops to zero, the memory is freed. Python also has a generational garbage collector to handle circular references."
)

results = asyncio.run(evaluate([test_case], [metric]))

Advanced — Custom Steps

metric = CustomEvalMetric(
    model="gpt-4o",
    threshold=0.7,
    name="CustomerServiceQuality",
    criteria="Evaluate the quality of customer service responses",
    evaluation_steps=[
        "Check if the response acknowledges the customer's issue",
        "Verify that a solution or next step is provided",
        "Assess the tone for empathy and professionalism",
        "Check if the response is clear and free of jargon",
        "Verify the response is concise but complete"
    ],
    temperature=0.5
)

Domain-Specific Examples

# Medical Information Quality
medical_metric = CustomEvalMetric(
    model="gpt-4o",
    threshold=0.8,
    name="MedicalInfoQuality",
    criteria="Evaluate the accuracy, safety, and appropriateness of medical information. Check for disclaimers about consulting healthcare professionals.",
    temperature=0.2  # Strict for safety
)

# Code Review Quality
code_review_metric = CustomEvalMetric(
    model="gpt-4o",
    threshold=0.7,
    name="CodeReviewQuality",
    criteria="Evaluate the quality of code review feedback: completeness, actionability, and identification of potential bugs or improvements."
)

# Educational Content
education_metric = CustomEvalMetric(
    model="anthropic:claude-3-5-sonnet-latest",
    threshold=0.7,
    name="EducationalClarity",
    criteria="Evaluate educational content for clarity, progressive complexity, use of examples, and engagement."
)

Scoring

The standard 5-level verdict system is applied to each evaluation criterion:

Verdict Weight
fully 1.0
mostly 0.9
partial 0.7
minor 0.3
none 0.0

Cost

1-2 LLM API calls per evaluation (1 if steps are provided, 2 if auto-generated).