Score Aggregation¶
Eval AI Library uses Temperature-Controlled Verdict Aggregation via Generalized Power Mean — a novel approach for combining multiple verdict scores into a single metric score.
The Problem¶
When evaluating AI outputs, individual statements receive different verdict levels (fully, mostly, partial, minor, none). Simply averaging these can miss important nuances — a single hallucination in an otherwise perfect answer might be critical in healthcare but acceptable in casual conversation.
The Solution: Generalized Power Mean¶
The score_agg() function uses a generalized power mean with temperature-controlled strictness:
\[ M_p(x_1, ..., x_n) = \left(\frac{1}{n} \sum_{i=1}^{n} x_i^p\right)^{1/p} \]
Where p is derived from the temperature parameter.
Temperature Mapping¶
| Temperature | Power (p) | Behavior | Use Case |
|---|---|---|---|
| 0.1 | -8.0 | Close to minimum | Safety-critical (medical, legal) |
| 0.3 | -2.5 | Below arithmetic mean | Important accuracy |
| 0.5 | 1.0 | Arithmetic mean | General evaluation |
| 0.7 | 4.6 | Above arithmetic mean | Lenient assessment |
| 1.0 | 12.25 | Close to maximum | Creative tasks |
How It Works¶
from eval_lib import score_agg
verdicts = [1.0, 0.9, 0.7, 0.0] # fully, mostly, partial, none
# Strict: penalizes the "none" heavily
score_agg(verdicts, temperature=0.1) # ≈ 0.05
# Balanced: arithmetic mean
score_agg(verdicts, temperature=0.5) # ≈ 0.62
# Lenient: forgiving of the "none"
score_agg(verdicts, temperature=1.0) # ≈ 0.95
Parameters¶
def score_agg(
scores: list[float],
temperature: float = 0.5,
penalty: float = 0.1,
eps_for_neg_p: float = 1e-9
) -> float:
| Parameter | Type | Default | Description |
|---|---|---|---|
scores | list[float] | required | Verdict weights (0.0-1.0) |
temperature | float | 0.5 | Strictness (0.1=strict, 1.0=lenient) |
penalty | float | 0.1 | Additional penalty for zero scores |
eps_for_neg_p | float | 1e-9 | Small value to avoid division by zero |
Verdict Weights¶
All metrics using verdict aggregation share this standard mapping:
| Verdict | Weight |
|---|---|
fully | 1.0 |
mostly | 0.9 |
partial | 0.7 |
minor | 0.3 |
none | 0.0 |
Visual Comparison¶
Verdicts: [1.0, 0.9, 0.7, 0.3, 0.0]
t=0.1 (strict): ████░░░░░░░░░░░░ 0.08
t=0.3: ████████░░░░░░░░ 0.45
t=0.5 (balanced): █████████░░░░░░░ 0.58
t=0.7: ███████████░░░░░ 0.72
t=1.0 (lenient): ██████████████░░ 0.92
Practical Guidelines¶
Use Strict Temperature (0.1-0.3)¶
- Medical/healthcare AI — any hallucination is dangerous
- Legal document analysis — accuracy is critical
- Financial calculations — errors have real consequences
- Security evaluations — any vulnerability matters
Use Balanced Temperature (0.4-0.6)¶
- General Q&A systems
- Customer support bots
- Educational content
- Standard RAG evaluations
Use Lenient Temperature (0.7-1.0)¶
- Creative writing assistants
- Brainstorming tools
- Casual conversation bots
- Exploratory search
Using Temperature in Metrics¶
# Strict evaluation for medical Q&A
metric = FaithfulnessMetric(
model="gpt-4o",
threshold=0.9,
temperature=0.1 # Any unfaithful statement tanks the score
)
# Balanced evaluation for general assistant
metric = AnswerRelevancyMetric(
model="gpt-4o",
threshold=0.7,
temperature=0.5 # Standard arithmetic mean
)
# Lenient evaluation for creative writing
metric = CustomEvalMetric(
model="gpt-4o",
threshold=0.6,
name="CreativityScore",
criteria="Evaluate creative writing quality",
temperature=0.8 # Forgiving of individual weak areas
)