Answer Precision¶
The Answer Precision metric measures how precisely the actual output matches the expected output using purely algorithmic methods — no LLM calls required.
How It Works¶
The metric combines five sub-scores using a weighted power mean:
graph TD
A[Actual Output] --> C[1. Exact Match]
B[Expected Output] --> C
A --> D[2. Character Similarity]
B --> D
A --> E[3. Token Precision]
B --> E
A --> F[4. Numeric Agreement]
B --> F
A --> G[5. Token Containment]
B --> G
C --> H[Weighted Power Mean]
D --> H
E --> H
F --> H
G --> H
H --> I[Final Score 0.0-1.0] Sub-Score Components¶
| Component | Weight | Description |
|---|---|---|
| Exact Match | 0.25 | Binary: 1.0 if outputs are identical (case-insensitive) |
| Character Similarity | 0.25 | SequenceMatcher ratio between strings |
| Token Precision | 0.20 | Overlap coefficient of token sets |
| Numeric Agreement | 0.15 | Agreement of numeric values (with tolerance) |
| Token Containment | 0.15 | How many expected tokens appear in actual output |
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
threshold | float | 0.6 | Minimum score to pass |
token_stopwords | set | English stopwords | Words to ignore in token comparison |
numeric_tolerance_abs | float | 0.01 | Absolute tolerance for numeric comparison |
numeric_tolerance_rel | float | 0.05 | Relative tolerance for numeric comparison |
power_p | float | 0.3 | Power mean exponent |
Required Fields¶
| Field | Required |
|---|---|
input | Yes |
actual_output | Yes |
expected_output | Yes |
Expected output required
Unlike most metrics, Answer Precision requires expected_output since it compares actual vs. expected outputs directly.
Usage¶
from eval_lib import AnswerPrecisionMetric, EvalTestCase, evaluate
import asyncio
test_case = EvalTestCase(
input="What is 2 + 2?",
actual_output="The answer is 4.",
expected_output="4"
)
metric = AnswerPrecisionMetric(threshold=0.6)
results = asyncio.run(evaluate([test_case], [metric]))
Cost¶
0 LLM API calls — this metric is entirely algorithmic, making it the fastest and cheapest metric available.
Best Use Cases¶
- Factual Q&A where answers have clear expected values
- Numeric outputs (financial calculations, statistics)
- Classification tasks where output should match a label
- Any evaluation where you need a fast, deterministic score