G-Eval¶
G-Eval is a state-of-the-art evaluation framework that uses Chain-of-Thought (CoT) reasoning and probability-weighted scoring for highly accurate evaluations. Based on the paper G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment.
How It Works¶
graph TD
A[Criteria] --> B["1. Generate Eval Steps (CoT)"]
B --> C[2. Sample N Evaluations]
C --> D[3. Extract Probabilities]
D --> E["4. Weighted Score: Σ p(s) × s"]
E --> F[5. Explanation]
F --> G["Final Score 0.0–1.0"] - CoT Step Generation — automatically generates detailed evaluation steps from your criteria
- Multi-Sampling — samples N evaluations with high temperature for diversity
- Probability Weighting — uses token probabilities to weight each score
- Final Score — probability-weighted average across all samples
- Explanation — generates a human-readable explanation using the evaluation steps
Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | required | LLM model ("gpt-4o", "anthropic:claude-3-5-sonnet-latest", "google:gemini-2.0-flash", "ollama:llama3", or CustomLLMClient) |
threshold | float | required | Minimum score to pass |
criteria | str | required | What to evaluate |
evaluation_steps | list[str] | None | Custom steps (auto-generated if not provided) |
n_samples | int | 20 | Number of evaluation samples |
sampling_temperature | float | 2.0 | Temperature for sampling (high = diverse) |
Usage¶
Basic Usage¶
from eval_lib import GEval, EvalTestCase, evaluate
import asyncio
metric = GEval(
model="gpt-4o",
threshold=0.7,
criteria="Evaluate the coherence and logical flow of the text.",
n_samples=20
)
test_case = EvalTestCase(
input="Write a summary of climate change.",
actual_output="Climate change refers to long-term shifts in temperatures and weather patterns. Human activities have been the main driver since the 1800s, primarily due to burning fossil fuels. This produces greenhouse gases that trap heat in the atmosphere, leading to global warming."
)
results = asyncio.run(evaluate([test_case], [metric]))
With Custom Steps¶
metric = GEval(
model="gpt-4o",
threshold=0.7,
criteria="Evaluate the quality of a product description for an e-commerce listing.",
evaluation_steps=[
"Check if key product features are mentioned",
"Assess clarity and readability for a general audience",
"Verify the description highlights unique selling points",
"Check for appropriate length (not too short, not too long)",
"Assess persuasiveness and call-to-action"
],
n_samples=15,
sampling_temperature=1.5
)
G-Eval vs Custom Metric¶
| Feature | G-Eval | Custom Metric |
|---|---|---|
| Scoring Method | Probability-weighted sampling | Verdict aggregation |
| Samples | N samples (default 20) | Single evaluation |
| Temperature | High (2.0) for diversity | Configurable (0.1-1.0) |
| Accuracy | Higher (statistical approach) | Good (verdict-based) |
| Cost | Higher (2 + N API calls) | Lower (1-2 API calls) |
| Best For | Critical evaluations, benchmarking | General-purpose, cost-sensitive |
Cost¶
2 + n_samples LLM API calls per evaluation. With default settings (n_samples=20), this is 22 API calls.
Cost Optimization
Reduce n_samples to 5-10 for faster, cheaper evaluations while maintaining reasonable accuracy. Use the full 20 samples for critical benchmarks.
Score Interpretation¶
G-Eval scores tend to be more stable and reliable than single-shot evaluations due to the statistical nature of probability-weighted sampling:
| Score | Interpretation |
|---|---|
| 0.9-1.0 | Excellent — meets criteria comprehensively |
| 0.7-0.9 | Good — meets most criteria with minor gaps |
| 0.5-0.7 | Moderate — partially meets criteria |
| 0.3-0.5 | Below average — significant gaps |
| 0.0-0.3 | Poor — fails to meet criteria |