Skip to content

G-Eval

G-Eval is a state-of-the-art evaluation framework that uses Chain-of-Thought (CoT) reasoning and probability-weighted scoring for highly accurate evaluations. Based on the paper G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment.

How It Works

graph TD
    A[Criteria] --> B["1. Generate Eval Steps (CoT)"]
    B --> C[2. Sample N Evaluations]
    C --> D[3. Extract Probabilities]
    D --> E["4. Weighted Score: Σ p(s) × s"]
    E --> F[5. Explanation]
    F --> G["Final Score 0.0–1.0"]
  1. CoT Step Generation — automatically generates detailed evaluation steps from your criteria
  2. Multi-Sampling — samples N evaluations with high temperature for diversity
  3. Probability Weighting — uses token probabilities to weight each score
  4. Final Score — probability-weighted average across all samples
  5. Explanation — generates a human-readable explanation using the evaluation steps

Parameters

Parameter Type Default Description
model str required LLM model ("gpt-4o", "anthropic:claude-3-5-sonnet-latest", "google:gemini-2.0-flash", "ollama:llama3", or CustomLLMClient)
threshold float required Minimum score to pass
criteria str required What to evaluate
evaluation_steps list[str] None Custom steps (auto-generated if not provided)
n_samples int 20 Number of evaluation samples
sampling_temperature float 2.0 Temperature for sampling (high = diverse)

Usage

Basic Usage

from eval_lib import GEval, EvalTestCase, evaluate
import asyncio

metric = GEval(
    model="gpt-4o",
    threshold=0.7,
    criteria="Evaluate the coherence and logical flow of the text.",
    n_samples=20
)

test_case = EvalTestCase(
    input="Write a summary of climate change.",
    actual_output="Climate change refers to long-term shifts in temperatures and weather patterns. Human activities have been the main driver since the 1800s, primarily due to burning fossil fuels. This produces greenhouse gases that trap heat in the atmosphere, leading to global warming."
)

results = asyncio.run(evaluate([test_case], [metric]))

With Custom Steps

metric = GEval(
    model="gpt-4o",
    threshold=0.7,
    criteria="Evaluate the quality of a product description for an e-commerce listing.",
    evaluation_steps=[
        "Check if key product features are mentioned",
        "Assess clarity and readability for a general audience",
        "Verify the description highlights unique selling points",
        "Check for appropriate length (not too short, not too long)",
        "Assess persuasiveness and call-to-action"
    ],
    n_samples=15,
    sampling_temperature=1.5
)

G-Eval vs Custom Metric

Feature G-Eval Custom Metric
Scoring Method Probability-weighted sampling Verdict aggregation
Samples N samples (default 20) Single evaluation
Temperature High (2.0) for diversity Configurable (0.1-1.0)
Accuracy Higher (statistical approach) Good (verdict-based)
Cost Higher (2 + N API calls) Lower (1-2 API calls)
Best For Critical evaluations, benchmarking General-purpose, cost-sensitive

Cost

2 + n_samples LLM API calls per evaluation. With default settings (n_samples=20), this is 22 API calls.

Cost Optimization

Reduce n_samples to 5-10 for faster, cheaper evaluations while maintaining reasonable accuracy. Use the full 20 samples for critical benchmarks.

Score Interpretation

G-Eval scores tend to be more stable and reliable than single-shot evaluations due to the statistical nature of probability-weighted sampling:

Score Interpretation
0.9-1.0 Excellent — meets criteria comprehensively
0.7-0.9 Good — meets most criteria with minor gaps
0.5-0.7 Moderate — partially meets criteria
0.3-0.5 Below average — significant gaps
0.0-0.3 Poor — fails to meet criteria