Skip to content

Answer Precision

The Answer Precision metric measures how precisely the actual output matches the expected output using purely algorithmic methods — no LLM calls required.

How It Works

The metric combines five sub-scores using a weighted power mean:

graph TD
    A[Actual Output] --> C[1. Exact Match]
    B[Expected Output] --> C
    A --> D[2. Character Similarity]
    B --> D
    A --> E[3. Token Precision]
    B --> E
    A --> F[4. Numeric Agreement]
    B --> F
    A --> G[5. Token Containment]
    B --> G
    C --> H[Weighted Power Mean]
    D --> H
    E --> H
    F --> H
    G --> H
    H --> I[Final Score 0.0-1.0]

Sub-Score Components

Component Weight Description
Exact Match 0.25 Binary: 1.0 if outputs are identical (case-insensitive)
Character Similarity 0.25 SequenceMatcher ratio between strings
Token Precision 0.20 Overlap coefficient of token sets
Numeric Agreement 0.15 Agreement of numeric values (with tolerance)
Token Containment 0.15 How many expected tokens appear in actual output

Parameters

Parameter Type Default Description
threshold float 0.6 Minimum score to pass
token_stopwords set English stopwords Words to ignore in token comparison
numeric_tolerance_abs float 0.01 Absolute tolerance for numeric comparison
numeric_tolerance_rel float 0.05 Relative tolerance for numeric comparison
power_p float 0.3 Power mean exponent

Required Fields

Field Required
input Yes
actual_output Yes
expected_output Yes

Expected output required

Unlike most metrics, Answer Precision requires expected_output since it compares actual vs. expected outputs directly.

Usage

from eval_lib import AnswerPrecisionMetric, EvalTestCase, evaluate
import asyncio

test_case = EvalTestCase(
    input="What is 2 + 2?",
    actual_output="The answer is 4.",
    expected_output="4"
)

metric = AnswerPrecisionMetric(threshold=0.6)

results = asyncio.run(evaluate([test_case], [metric]))

Cost

0 LLM API calls — this metric is entirely algorithmic, making it the fastest and cheapest metric available.

Best Use Cases

  • Factual Q&A where answers have clear expected values
  • Numeric outputs (financial calculations, statistics)
  • Classification tasks where output should match a label
  • Any evaluation where you need a fast, deterministic score