Score Aggregation¶

Eval AI Library uses Temperature-Controlled Verdict Aggregation via Generalized Power Mean — a novel approach for combining multiple verdict scores into a single metric score.

The Problem¶

When evaluating AI outputs, individual statements receive different verdict levels (fully, mostly, partial, minor, none). Simply averaging these can miss important nuances — a single hallucination in an otherwise perfect answer might be critical in healthcare but acceptable in casual conversation.

The Solution: Generalized Power Mean¶

The score_agg() function uses a generalized power mean with temperature-controlled strictness:

\[ M_p(x_1, ..., x_n) = \left(\frac{1}{n} \sum_{i=1}^{n} x_i^p\right)^{1/p} \]

Where p is derived from the temperature parameter.

Temperature Mapping¶

Temperature	Power (p)	Behavior	Use Case
0.1	-8.0	Close to minimum	Safety-critical (medical, legal)
0.3	-2.5	Below arithmetic mean	Important accuracy
0.5	1.0	Arithmetic mean	General evaluation
0.7	4.6	Above arithmetic mean	Lenient assessment
1.0	12.25	Close to maximum	Creative tasks

How It Works¶

from eval_lib import score_agg

verdicts = [1.0, 0.9, 0.7, 0.0]  # fully, mostly, partial, none

# Strict: penalizes the "none" heavily
score_agg(verdicts, temperature=0.1)  # ≈ 0.05

# Balanced: arithmetic mean
score_agg(verdicts, temperature=0.5)  # ≈ 0.62

# Lenient: forgiving of the "none"
score_agg(verdicts, temperature=1.0)  # ≈ 0.95

Parameters¶

def score_agg(
    scores: list[float],
    temperature: float = 0.5,
    penalty: float = 0.1,
    eps_for_neg_p: float = 1e-9
) -> float:

Parameter	Type	Default	Description
`scores`	`list[float]`	required	Verdict weights (0.0-1.0)
`temperature`	`float`	`0.5`	Strictness (0.1=strict, 1.0=lenient)
`penalty`	`float`	`0.1`	Additional penalty for zero scores
`eps_for_neg_p`	`float`	`1e-9`	Small value to avoid division by zero

Verdict Weights¶

All metrics using verdict aggregation share this standard mapping:

Verdict	Weight
`fully`	1.0
`mostly`	0.9
`partial`	0.7
`minor`	0.3
`none`	0.0

Visual Comparison¶

Verdicts: [1.0, 0.9, 0.7, 0.3, 0.0]

t=0.1 (strict):   ████░░░░░░░░░░░░  0.08
t=0.3:             ████████░░░░░░░░  0.45
t=0.5 (balanced):  █████████░░░░░░░  0.58
t=0.7:             ███████████░░░░░  0.72
t=1.0 (lenient):   ██████████████░░  0.92

Practical Guidelines¶

Use Strict Temperature (0.1-0.3)¶

Medical/healthcare AI — any hallucination is dangerous
Legal document analysis — accuracy is critical
Financial calculations — errors have real consequences
Security evaluations — any vulnerability matters

Use Balanced Temperature (0.4-0.6)¶

General Q&A systems
Customer support bots
Educational content
Standard RAG evaluations

Use Lenient Temperature (0.7-1.0)¶

Creative writing assistants
Brainstorming tools
Casual conversation bots
Exploratory search

Using Temperature in Metrics¶

# Strict evaluation for medical Q&A
metric = FaithfulnessMetric(
    model="gpt-4o",
    threshold=0.9,
    temperature=0.1  # Any unfaithful statement tanks the score
)

# Balanced evaluation for general assistant
metric = AnswerRelevancyMetric(
    model="gpt-4o",
    threshold=0.7,
    temperature=0.5  # Standard arithmetic mean
)

# Lenient evaluation for creative writing
metric = CustomEvalMetric(
    model="gpt-4o",
    threshold=0.6,
    name="CreativityScore",
    criteria="Evaluate creative writing quality",
    temperature=0.8  # Forgiving of individual weak areas
)