Reference Match¶
The Reference Match metric computes the embedding similarity between actual_output and a list of reference texts. The final score is aggregated using either the maximum or mean similarity across all references. No LLM model is needed.
How It Works¶
graph TD
A[actual_output] --> B[1. Generate Embedding]
C[references] --> D[2. Generate Embeddings]
B --> E[3. Cosine Similarity per Reference]
D --> E
E --> F[4. Aggregate: max or mean]
F --> G[Final Score 0.0-1.0] - Embed actual output — converts
actual_outputto a vector representation - Embed references — converts each reference text to a vector representation
- Pairwise similarity — computes cosine similarity between
actual_outputand each reference - Aggregation — takes the
maxormeanof all similarity scores
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
threshold | float | 0.7 | Minimum score to pass |
references | list[str] | required | List of reference texts to compare against |
aggregation | str | "max" | Aggregation strategy ("max" or "mean") |
embedding_provider | str | "openai" | Embedding provider ("openai" or "local") |
model_name | str | provider default | Embedding model name |
Required Fields¶
| Field | Required |
|---|---|
actual_output | Yes |
references | Yes (via parameter) |
input | No |
expected_output | No |
Usage¶
from eval_lib.metrics.vector_metrics import ReferenceMatchMetric
from eval_lib import EvalTestCase, evaluate
import asyncio
test_case = EvalTestCase(
actual_output="The capital of France is Paris, located along the Seine River."
)
metric = ReferenceMatchMetric(
threshold=0.75,
references=[
"Paris is the capital and largest city of France.",
"France's capital city is Paris.",
"Paris, situated on the Seine, serves as France's capital."
],
aggregation="max",
embedding_provider="openai"
)
results = asyncio.run(evaluate([test_case], [metric]))
Aggregation Strategies¶
"max" (default)¶
Returns the highest similarity score across all references. Use this when any single matching reference is sufficient.
"mean"¶
Returns the average similarity score across all references. Use this when the output should be consistent with all references.
Cost¶
1 embedding API call per evaluation (all texts are batched into a single request).
Example Scenarios¶
High Score (0.90+)¶
metric = ReferenceMatchMetric(
references=[
"To reset your password, navigate to Settings.",
"Go to Settings > Security to change your password."
],
aggregation="max"
)
EvalTestCase(
actual_output="Navigate to Settings > Security to reset your password."
)
# High similarity with at least one reference
Low Score (< 0.5)¶
metric = ReferenceMatchMetric(
references=[
"To reset your password, navigate to Settings.",
"Go to Settings > Security to change your password."
],
aggregation="mean"
)
EvalTestCase(
actual_output="Our company was founded in 2020 and is based in San Francisco."
)
# No semantic overlap with any reference