Pricing¶
Eval AI Library tracks API costs for all LLM calls. Here are the supported model prices.
OpenAI¶
| Model | Input ($/1M tokens) | Output ($/1M tokens) |
|---|---|---|
| gpt-4o | $2.50 | $10.00 |
| gpt-4o-mini | $0.15 | $0.60 |
| gpt-4 | $30.00 | $60.00 |
| gpt-3.5-turbo | $0.50 | $1.50 |
| o1 | $15.00 | $60.00 |
| o3-mini | $1.10 | $4.40 |
Embedding Models¶
| Model | Input ($/1M tokens) |
|---|---|
| text-embedding-3-small | $0.02 |
| text-embedding-3-large | $0.13 |
Google Gemini¶
| Model | Input ($/1M tokens) | Output ($/1M tokens) |
|---|---|---|
| gemini-2.5-pro-preview | $1.25 | $10.00 |
| gemini-2.5-flash-preview | $0.15 | $0.60 |
| gemini-2.0-flash | $0.10 | $0.40 |
| gemini-2.0-flash-lite | $0.075 | $0.30 |
| gemini-1.5-pro | $1.25 | $5.00 |
| gemini-1.5-flash | $0.075 | $0.30 |
| gemini-1.5-flash-8b | $0.0375 | $0.15 |
Anthropic Claude¶
| Model | Input ($/1M tokens) | Output ($/1M tokens) |
|---|---|---|
| claude-sonnet-4-0 | $3.00 | $15.00 |
| claude-3-7-sonnet-latest | $3.00 | $15.00 |
| claude-3-5-sonnet-latest | $3.00 | $15.00 |
| claude-3-5-haiku-latest | $0.80 | $4.00 |
| claude-3-haiku-20240307 | $0.25 | $1.25 |
Cost Estimation¶
Per-Metric Cost¶
| Metric | LLM Calls | Approximate Cost (gpt-4o) |
|---|---|---|
| Answer Relevancy | 4 | ~$0.003 |
| Faithfulness | 3 | ~$0.002 |
| Contextual Relevancy | 3 | ~$0.002 |
| Contextual Recall | 2 | ~$0.001 |
| Bias / Toxicity | 1 | ~$0.001 |
| G-Eval (20 samples) | 22 | ~$0.015 |
| Answer Precision | 0 | $0.00 |
| Tool Correctness | 0 | $0.00 |
Example: Full RAG Evaluation¶
Evaluating 100 test cases with 4 metrics (Answer Relevancy + Faithfulness + Contextual Relevancy + Contextual Recall):
- LLM calls: 100 × (4 + 3 + 3 + 2) = 1,200 calls
- Estimated cost: ~$0.80 with gpt-4o
Cost Tracking in Results¶
When verbose=True (default), the evaluation summary includes cost information:
======================================================================
📋 EVALUATION SUMMARY
======================================================================
Overall Results:
✅ Passed: 3 / 3
❌ Failed: 0 / 3
📊 Success Rate: 100.0%
Resource Usage:
💰 Total Cost: $0.034200
⏱️ Total Time: 12.45s
📈 Avg Time per Test: 4.15s
======================================================================
You can also access cost per metric programmatically:
for _, test_results in results:
for result in test_results:
for metric in result.metrics_data:
print(f"{metric.name}: ${metric.evaluation_cost:.4f}")
Cost Optimization
- Use
gpt-4o-miniorgemini-2.0-flashfor development/testing - Use
gpt-4oorclaude-3-5-sonnet-latestfor final evaluations AnswerPrecisionMetricandToolCorrectnessMetricare free (no LLM calls)- Reduce G-Eval
n_samplesfrom 20 to 5-10 for cheaper runs