Skip to content

Reference Match

The Reference Match metric computes the embedding similarity between actual_output and a list of reference texts. The final score is aggregated using either the maximum or mean similarity across all references. No LLM model is needed.

How It Works

graph TD
    A[actual_output] --> B[1. Generate Embedding]
    C[references] --> D[2. Generate Embeddings]
    B --> E[3. Cosine Similarity per Reference]
    D --> E
    E --> F[4. Aggregate: max or mean]
    F --> G[Final Score 0.0-1.0]
  1. Embed actual output — converts actual_output to a vector representation
  2. Embed references — converts each reference text to a vector representation
  3. Pairwise similarity — computes cosine similarity between actual_output and each reference
  4. Aggregation — takes the max or mean of all similarity scores

Parameters

Parameter Type Default Description
threshold float 0.7 Minimum score to pass
references list[str] required List of reference texts to compare against
aggregation str "max" Aggregation strategy ("max" or "mean")
embedding_provider str "openai" Embedding provider ("openai" or "local")
model_name str provider default Embedding model name

Required Fields

Field Required
actual_output Yes
references Yes (via parameter)
input No
expected_output No

Usage

from eval_lib.metrics.vector_metrics import ReferenceMatchMetric
from eval_lib import EvalTestCase, evaluate
import asyncio

test_case = EvalTestCase(
    actual_output="The capital of France is Paris, located along the Seine River."
)

metric = ReferenceMatchMetric(
    threshold=0.75,
    references=[
        "Paris is the capital and largest city of France.",
        "France's capital city is Paris.",
        "Paris, situated on the Seine, serves as France's capital."
    ],
    aggregation="max",
    embedding_provider="openai"
)

results = asyncio.run(evaluate([test_case], [metric]))

Aggregation Strategies

"max" (default)

Returns the highest similarity score across all references. Use this when any single matching reference is sufficient.

"mean"

Returns the average similarity score across all references. Use this when the output should be consistent with all references.

Cost

1 embedding API call per evaluation (all texts are batched into a single request).

Example Scenarios

High Score (0.90+)

metric = ReferenceMatchMetric(
    references=[
        "To reset your password, navigate to Settings.",
        "Go to Settings > Security to change your password."
    ],
    aggregation="max"
)
EvalTestCase(
    actual_output="Navigate to Settings > Security to reset your password."
)
# High similarity with at least one reference

Low Score (< 0.5)

metric = ReferenceMatchMetric(
    references=[
        "To reset your password, navigate to Settings.",
        "Go to Settings > Security to change your password."
    ],
    aggregation="mean"
)
EvalTestCase(
    actual_output="Our company was founded in 2020 and is based in San Francisco."
)
# No semantic overlap with any reference