G-Eval¶

G-Eval is a state-of-the-art evaluation framework that uses Chain-of-Thought (CoT) reasoning and probability-weighted scoring for highly accurate evaluations. Based on the paper G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment.

How It Works¶

graph TD
    A[Criteria] --> B["1. Generate Eval Steps (CoT)"]
    B --> C[2. Sample N Evaluations]
    C --> D[3. Extract Probabilities]
    D --> E["4. Weighted Score: Σ p(s) × s"]
    E --> F[5. Explanation]
    F --> G["Final Score 0.0–1.0"]

CoT Step Generation — automatically generates detailed evaluation steps from your criteria
Multi-Sampling — samples N evaluations with high temperature for diversity
Probability Weighting — uses token probabilities to weight each score
Final Score — probability-weighted average across all samples
Explanation — generates a human-readable explanation using the evaluation steps

Parameters¶

Parameter	Type	Default	Description
`model`	`str`	required	LLM model (`"gpt-4o"`, `"anthropic:claude-3-5-sonnet-latest"`, `"google:gemini-2.0-flash"`, `"ollama:llama3"`, or `CustomLLMClient`)
`threshold`	`float`	required	Minimum score to pass
`criteria`	`str`	required	What to evaluate
`evaluation_steps`	`list[str]`	`None`	Custom steps (auto-generated if not provided)
`n_samples`	`int`	`20`	Number of evaluation samples
`sampling_temperature`	`float`	`2.0`	Temperature for sampling (high = diverse)

Usage¶

Basic Usage¶

from eval_lib import GEval, EvalTestCase, evaluate
import asyncio

metric = GEval(
    model="gpt-4o",
    threshold=0.7,
    criteria="Evaluate the coherence and logical flow of the text.",
    n_samples=20
)

test_case = EvalTestCase(
    input="Write a summary of climate change.",
    actual_output="Climate change refers to long-term shifts in temperatures and weather patterns. Human activities have been the main driver since the 1800s, primarily due to burning fossil fuels. This produces greenhouse gases that trap heat in the atmosphere, leading to global warming."
)

results = asyncio.run(evaluate([test_case], [metric]))

With Custom Steps¶

metric = GEval(
    model="gpt-4o",
    threshold=0.7,
    criteria="Evaluate the quality of a product description for an e-commerce listing.",
    evaluation_steps=[
        "Check if key product features are mentioned",
        "Assess clarity and readability for a general audience",
        "Verify the description highlights unique selling points",
        "Check for appropriate length (not too short, not too long)",
        "Assess persuasiveness and call-to-action"
    ],
    n_samples=15,
    sampling_temperature=1.5
)

G-Eval vs Custom Metric¶

Feature	G-Eval	Custom Metric
Scoring Method	Probability-weighted sampling	Verdict aggregation
Samples	N samples (default 20)	Single evaluation
Temperature	High (2.0) for diversity	Configurable (0.1-1.0)
Accuracy	Higher (statistical approach)	Good (verdict-based)
Cost	Higher (2 + N API calls)	Lower (1-2 API calls)
Best For	Critical evaluations, benchmarking	General-purpose, cost-sensitive

Cost¶

2 + n_samples LLM API calls per evaluation. With default settings (n_samples=20), this is 22 API calls.

Cost Optimization

Reduce n_samples to 5-10 for faster, cheaper evaluations while maintaining reasonable accuracy. Use the full 20 samples for critical benchmarks.

Score Interpretation¶

G-Eval scores tend to be more stable and reliable than single-shot evaluations due to the statistical nature of probability-weighted sampling:

Score	Interpretation
0.9-1.0	Excellent — meets criteria comprehensively
0.7-0.9	Good — meets most criteria with minor gaps
0.5-0.7	Moderate — partially meets criteria
0.3-0.5	Below average — significant gaps
0.0-0.3	Poor — fails to meet criteria