Answer Precision¶

The Answer Precision metric measures how precisely the actual output matches the expected output using purely algorithmic methods — no LLM calls required.

How It Works¶

The metric combines five sub-scores using a weighted power mean:

graph TD
    A[Actual Output] --> C[1. Exact Match]
    B[Expected Output] --> C
    A --> D[2. Character Similarity]
    B --> D
    A --> E[3. Token Precision]
    B --> E
    A --> F[4. Numeric Agreement]
    B --> F
    A --> G[5. Token Containment]
    B --> G
    C --> H[Weighted Power Mean]
    D --> H
    E --> H
    F --> H
    G --> H
    H --> I[Final Score 0.0-1.0]

Sub-Score Components¶

Component	Weight	Description
Exact Match	0.25	Binary: 1.0 if outputs are identical (case-insensitive)
Character Similarity	0.25	SequenceMatcher ratio between strings
Token Precision	0.20	Overlap coefficient of token sets
Numeric Agreement	0.15	Agreement of numeric values (with tolerance)
Token Containment	0.15	How many expected tokens appear in actual output

Parameters¶

Parameter	Type	Default	Description
`threshold`	`float`	`0.6`	Minimum score to pass
`token_stopwords`	`set`	English stopwords	Words to ignore in token comparison
`numeric_tolerance_abs`	`float`	`0.01`	Absolute tolerance for numeric comparison
`numeric_tolerance_rel`	`float`	`0.05`	Relative tolerance for numeric comparison
`power_p`	`float`	`0.3`	Power mean exponent

Required Fields¶

Field	Required
`input`	Yes
`actual_output`	Yes
`expected_output`	Yes

Expected output required

Unlike most metrics, Answer Precision requires expected_output since it compares actual vs. expected outputs directly.

Usage¶

from eval_lib import AnswerPrecisionMetric, EvalTestCase, evaluate
import asyncio

test_case = EvalTestCase(
    input="What is 2 + 2?",
    actual_output="The answer is 4.",
    expected_output="4"
)

metric = AnswerPrecisionMetric(threshold=0.6)

results = asyncio.run(evaluate([test_case], [metric]))

Cost¶

0 LLM API calls — this metric is entirely algorithmic, making it the fastest and cheapest metric available.

Best Use Cases¶

Factual Q&A where answers have clear expected values
Numeric outputs (financial calculations, statistics)
Classification tasks where output should match a label
Any evaluation where you need a fast, deterministic score