Skip to content

Pricing

Eval AI Library tracks API costs for all LLM calls. Here are the supported model prices.

OpenAI

Model Input ($/1M tokens) Output ($/1M tokens)
gpt-4o $2.50 $10.00
gpt-4o-mini $0.15 $0.60
gpt-4 $30.00 $60.00
gpt-3.5-turbo $0.50 $1.50
o1 $15.00 $60.00
o3-mini $1.10 $4.40

Embedding Models

Model Input ($/1M tokens)
text-embedding-3-small $0.02
text-embedding-3-large $0.13

Google Gemini

Model Input ($/1M tokens) Output ($/1M tokens)
gemini-2.5-pro-preview $1.25 $10.00
gemini-2.5-flash-preview $0.15 $0.60
gemini-2.0-flash $0.10 $0.40
gemini-2.0-flash-lite $0.075 $0.30
gemini-1.5-pro $1.25 $5.00
gemini-1.5-flash $0.075 $0.30
gemini-1.5-flash-8b $0.0375 $0.15

Anthropic Claude

Model Input ($/1M tokens) Output ($/1M tokens)
claude-sonnet-4-0 $3.00 $15.00
claude-3-7-sonnet-latest $3.00 $15.00
claude-3-5-sonnet-latest $3.00 $15.00
claude-3-5-haiku-latest $0.80 $4.00
claude-3-haiku-20240307 $0.25 $1.25

Cost Estimation

Per-Metric Cost

Metric LLM Calls Approximate Cost (gpt-4o)
Answer Relevancy 4 ~$0.003
Faithfulness 3 ~$0.002
Contextual Relevancy 3 ~$0.002
Contextual Recall 2 ~$0.001
Bias / Toxicity 1 ~$0.001
G-Eval (20 samples) 22 ~$0.015
Answer Precision 0 $0.00
Tool Correctness 0 $0.00

Example: Full RAG Evaluation

Evaluating 100 test cases with 4 metrics (Answer Relevancy + Faithfulness + Contextual Relevancy + Contextual Recall):

  • LLM calls: 100 × (4 + 3 + 3 + 2) = 1,200 calls
  • Estimated cost: ~$0.80 with gpt-4o

Cost Tracking in Results

When verbose=True (default), the evaluation summary includes cost information:

======================================================================
                     📋 EVALUATION SUMMARY
======================================================================

Overall Results:
  ✅ Passed: 3 / 3
  ❌ Failed: 0 / 3
  📊 Success Rate: 100.0%

Resource Usage:
  💰 Total Cost: $0.034200
  ⏱️  Total Time: 12.45s
  📈 Avg Time per Test: 4.15s

======================================================================

You can also access cost per metric programmatically:

for _, test_results in results:
    for result in test_results:
        for metric in result.metrics_data:
            print(f"{metric.name}: ${metric.evaluation_cost:.4f}")

Cost Optimization

  • Use gpt-4o-mini or gemini-2.0-flash for development/testing
  • Use gpt-4o or claude-3-5-sonnet-latest for final evaluations
  • AnswerPrecisionMetric and ToolCorrectnessMetric are free (no LLM calls)
  • Reduce G-Eval n_samples from 20 to 5-10 for cheaper runs