Skip to content

Score Aggregation

Eval AI Library uses Temperature-Controlled Verdict Aggregation via Generalized Power Mean — a novel approach for combining multiple verdict scores into a single metric score.

The Problem

When evaluating AI outputs, individual statements receive different verdict levels (fully, mostly, partial, minor, none). Simply averaging these can miss important nuances — a single hallucination in an otherwise perfect answer might be critical in healthcare but acceptable in casual conversation.

The Solution: Generalized Power Mean

The score_agg() function uses a generalized power mean with temperature-controlled strictness:

\[ M_p(x_1, ..., x_n) = \left(\frac{1}{n} \sum_{i=1}^{n} x_i^p\right)^{1/p} \]

Where p is derived from the temperature parameter.

Temperature Mapping

Temperature Power (p) Behavior Use Case
0.1 -8.0 Close to minimum Safety-critical (medical, legal)
0.3 -2.5 Below arithmetic mean Important accuracy
0.5 1.0 Arithmetic mean General evaluation
0.7 4.6 Above arithmetic mean Lenient assessment
1.0 12.25 Close to maximum Creative tasks

How It Works

from eval_lib import score_agg

verdicts = [1.0, 0.9, 0.7, 0.0]  # fully, mostly, partial, none

# Strict: penalizes the "none" heavily
score_agg(verdicts, temperature=0.1)  # ≈ 0.05

# Balanced: arithmetic mean
score_agg(verdicts, temperature=0.5)  # ≈ 0.62

# Lenient: forgiving of the "none"
score_agg(verdicts, temperature=1.0)  # ≈ 0.95

Parameters

def score_agg(
    scores: list[float],
    temperature: float = 0.5,
    penalty: float = 0.1,
    eps_for_neg_p: float = 1e-9
) -> float:
Parameter Type Default Description
scores list[float] required Verdict weights (0.0-1.0)
temperature float 0.5 Strictness (0.1=strict, 1.0=lenient)
penalty float 0.1 Additional penalty for zero scores
eps_for_neg_p float 1e-9 Small value to avoid division by zero

Verdict Weights

All metrics using verdict aggregation share this standard mapping:

Verdict Weight
fully 1.0
mostly 0.9
partial 0.7
minor 0.3
none 0.0

Visual Comparison

Verdicts: [1.0, 0.9, 0.7, 0.3, 0.0]

t=0.1 (strict):   ████░░░░░░░░░░░░  0.08
t=0.3:             ████████░░░░░░░░  0.45
t=0.5 (balanced):  █████████░░░░░░░  0.58
t=0.7:             ███████████░░░░░  0.72
t=1.0 (lenient):   ██████████████░░  0.92

Practical Guidelines

Use Strict Temperature (0.1-0.3)

  • Medical/healthcare AI — any hallucination is dangerous
  • Legal document analysis — accuracy is critical
  • Financial calculations — errors have real consequences
  • Security evaluations — any vulnerability matters

Use Balanced Temperature (0.4-0.6)

  • General Q&A systems
  • Customer support bots
  • Educational content
  • Standard RAG evaluations

Use Lenient Temperature (0.7-1.0)

  • Creative writing assistants
  • Brainstorming tools
  • Casual conversation bots
  • Exploratory search

Using Temperature in Metrics

# Strict evaluation for medical Q&A
metric = FaithfulnessMetric(
    model="gpt-4o",
    threshold=0.9,
    temperature=0.1  # Any unfaithful statement tanks the score
)

# Balanced evaluation for general assistant
metric = AnswerRelevancyMetric(
    model="gpt-4o",
    threshold=0.7,
    temperature=0.5  # Standard arithmetic mean
)

# Lenient evaluation for creative writing
metric = CustomEvalMetric(
    model="gpt-4o",
    threshold=0.6,
    name="CreativityScore",
    criteria="Evaluate creative writing quality",
    temperature=0.8  # Forgiving of individual weak areas
)