Skip to content

Reference Match¶

The Reference Match metric computes the embedding similarity between actual_output and a list of reference texts. The final score is aggregated using either the maximum or mean similarity across all references. No LLM model is needed.

How It Works¶

graph TD
    A[actual_output] --> B[1. Generate Embedding]
    C[references] --> D[2. Generate Embeddings]
    B --> E[3. Cosine Similarity per Reference]
    D --> E
    E --> F[4. Aggregate: max or mean]
    F --> G[Final Score 0.0-1.0]

Embed actual output — converts actual_output to a vector representation
Embed references — converts each reference text to a vector representation
Pairwise similarity — computes cosine similarity between actual_output and each reference
Aggregation — takes the max or mean of all similarity scores

Parameters¶

Parameter	Type	Default	Description
`threshold`	`float`	`0.7`	Minimum score to pass
`references`	`list[str]`	required	List of reference texts to compare against
`aggregation`	`str`	`"max"`	Aggregation strategy (`"max"` or `"mean"`)
`embedding_provider`	`str`	`"openai"`	Embedding provider (`"openai"` or `"local"`)
`model_name`	`str`	provider default	Embedding model name

Required Fields¶

Field	Required
`actual_output`	Yes
`references`	Yes (via parameter)
`input`	No
`expected_output`	No

Usage¶

from eval_lib.metrics.vector_metrics import ReferenceMatchMetric
from eval_lib import EvalTestCase, evaluate
import asyncio

test_case = EvalTestCase(
    actual_output="The capital of France is Paris, located along the Seine River."
)

metric = ReferenceMatchMetric(
    threshold=0.75,
    references=[
        "Paris is the capital and largest city of France.",
        "France's capital city is Paris.",
        "Paris, situated on the Seine, serves as France's capital."
    ],
    aggregation="max",
    embedding_provider="openai"
)

results = asyncio.run(evaluate([test_case], [metric]))

Aggregation Strategies¶

`"max"` (default)¶

Returns the highest similarity score across all references. Use this when any single matching reference is sufficient.

`"mean"`¶

Returns the average similarity score across all references. Use this when the output should be consistent with all references.

Cost¶

1 embedding API call per evaluation (all texts are batched into a single request).

Example Scenarios¶

High Score (0.90+)¶

metric = ReferenceMatchMetric(
    references=[
        "To reset your password, navigate to Settings.",
        "Go to Settings > Security to change your password."
    ],
    aggregation="max"
)
EvalTestCase(
    actual_output="Navigate to Settings > Security to reset your password."
)
# High similarity with at least one reference

Low Score (< 0.5)¶

metric = ReferenceMatchMetric(
    references=[
        "To reset your password, navigate to Settings.",
        "Go to Settings > Security to change your password."
    ],
    aggregation="mean"
)
EvalTestCase(
    actual_output="Our company was founded in 2020 and is based in San Francisco."
)
# No semantic overlap with any reference