Pricing¶

Eval AI Library tracks API costs for all LLM calls. Here are the supported model prices.

OpenAI¶

Model	Input ($/1M tokens)	Output ($/1M tokens)
gpt-4o	$2.50	$10.00
gpt-4o-mini	$0.15	$0.60
gpt-4	$30.00	$60.00
gpt-3.5-turbo	$0.50	$1.50
o1	$15.00	$60.00
o3-mini	$1.10	$4.40

Embedding Models¶

Model	Input ($/1M tokens)
text-embedding-3-small	$0.02
text-embedding-3-large	$0.13

Google Gemini¶

Model	Input ($/1M tokens)	Output ($/1M tokens)
gemini-2.5-pro-preview	$1.25	$10.00
gemini-2.5-flash-preview	$0.15	$0.60
gemini-2.0-flash	$0.10	$0.40
gemini-2.0-flash-lite	$0.075	$0.30
gemini-1.5-pro	$1.25	$5.00
gemini-1.5-flash	$0.075	$0.30
gemini-1.5-flash-8b	$0.0375	$0.15

Anthropic Claude¶

Model	Input ($/1M tokens)	Output ($/1M tokens)
claude-sonnet-4-0	$3.00	$15.00
claude-3-7-sonnet-latest	$3.00	$15.00
claude-3-5-sonnet-latest	$3.00	$15.00
claude-3-5-haiku-latest	$0.80	$4.00
claude-3-haiku-20240307	$0.25	$1.25

Cost Estimation¶

Per-Metric Cost¶

Metric	LLM Calls	Approximate Cost (gpt-4o)
Answer Relevancy	4	~$0.003
Faithfulness	3	~$0.002
Contextual Relevancy	3	~$0.002
Contextual Recall	2	~$0.001
Bias / Toxicity	1	~$0.001
G-Eval (20 samples)	22	~$0.015
Answer Precision	0	$0.00
Tool Correctness	0	$0.00

Example: Full RAG Evaluation¶

Evaluating 100 test cases with 4 metrics (Answer Relevancy + Faithfulness + Contextual Relevancy + Contextual Recall):

LLM calls: 100 × (4 + 3 + 3 + 2) = 1,200 calls
Estimated cost: ~$0.80 with gpt-4o

Cost Tracking in Results¶

When verbose=True (default), the evaluation summary includes cost information:

======================================================================
                     📋 EVALUATION SUMMARY
======================================================================

Overall Results:
  ✅ Passed: 3 / 3
  ❌ Failed: 0 / 3
  📊 Success Rate: 100.0%

Resource Usage:
  💰 Total Cost: $0.034200
  ⏱️  Total Time: 12.45s
  📈 Avg Time per Test: 4.15s

======================================================================

You can also access cost per metric programmatically:

for _, test_results in results:
    for result in test_results:
        for metric in result.metrics_data:
            print(f"{metric.name}: ${metric.evaluation_cost:.4f}")

Cost Optimization

Use gpt-4o-mini or gemini-2.0-flash for development/testing
Use gpt-4o or claude-3-5-sonnet-latest for final evaluations
AnswerPrecisionMetric and ToolCorrectnessMetric are free (no LLM calls)
Reduce G-Eval n_samples from 20 to 5-10 for cheaper runs