Eval AI Library¶
Comprehensive AI Model Evaluation Framework
Evaluate RAG systems, AI agents, and LLM security with 15+ metrics, multi-provider support, and advanced scoring techniques.
Key Features¶
-
15+ Evaluation Metrics
RAG metrics, agent evaluations, security checks, and custom metrics — everything you need to evaluate AI systems comprehensively.
-
Agent Evaluation
Tool correctness, task success rate, role adherence, knowledge retention, and error detection for AI agents.
-
Security Testing
Prompt injection & jailbreak detection/resistance, PII leakage, harmful content detection, and policy compliance.
-
Multi-Provider Support
OpenAI, Azure OpenAI, Google Gemini, Anthropic Claude, Ollama, and custom LLM providers through a unified interface.
-
G-Eval & Custom Metrics
State-of-the-art G-Eval with probability-weighted scoring, plus fully customizable evaluation criteria.
-
Test Data Generation
Generate test cases from documents in 15+ formats including PDF, DOCX, CSV, JSON, HTML, and images with OCR.
Quick Example¶
import asyncio
from eval_lib import evaluate, EvalTestCase, AnswerRelevancyMetric, FaithfulnessMetric
test_case = EvalTestCase(
input="What is machine learning?",
actual_output="Machine learning is a subset of AI that enables systems to learn from data.",
expected_output="Machine learning is a branch of artificial intelligence focused on building systems that learn from data.",
retrieval_context=[
"Machine learning is a subset of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed."
]
)
metrics = [
AnswerRelevancyMetric(model="gpt-4o", threshold=0.7),
FaithfulnessMetric(model="gpt-4o", threshold=0.7),
]
results = asyncio.run(evaluate([test_case], metrics))
======================================================================
📝 Test Case 1/1
======================================================================
Input: What is machine learning?
Test Case Summary:
✅ Overall: PASSED
💰 Cost: $0.003400
Metrics Breakdown:
✅ Answer Relevancy: 0.92
✅ Faithfulness: 0.95
======================================================================
📋 EVALUATION SUMMARY
======================================================================
Overall Results:
✅ Passed: 1 / 1
📊 Success Rate: 100.0%
Resource Usage:
💰 Total Cost: $0.003400
⏱️ Total Time: 4.12s
======================================================================
Architecture Overview¶
graph TD
A[Test Cases] --> B[Evaluation Engine]
B --> C{Metrics}
C --> D[RAG Metrics]
C --> E[Agent Metrics]
C --> F[Security Metrics]
C --> G[Custom / G-Eval]
D --> H[LLM Provider]
E --> H
F --> H
G --> H
H --> I[OpenAI]
H --> J[Azure]
H --> K[Gemini]
H --> L[Claude]
H --> M[Ollama]
H --> N[Custom]
B --> O[Results & Dashboard] Metric Categories at a Glance¶
| Category | Metrics | Use Case |
|---|---|---|
| RAG | Answer Relevancy, Precision, Faithfulness, Contextual Relevancy/Precision/Recall, Bias, Toxicity | Evaluating retrieval-augmented generation pipelines |
| Agent | Tool Correctness, Task Success, Role Adherence, Knowledge Retention, Tool Errors | Evaluating AI agents and assistants |
| Security | Prompt Injection, Jailbreak, PII Leakage, Harmful Content, Policy Compliance | Security and safety testing |
| Custom | Custom Metric, G-Eval | Domain-specific evaluations with custom criteria |
Score Aggregation¶
Eval AI Library uses Temperature-Controlled Verdict Aggregation via Generalized Power Mean — a novel approach that provides nuanced scoring through configurable strictness levels.
| Temperature | Behavior | Best For |
|---|---|---|
| 0.1 | Strict (close to minimum) | Safety-critical applications |
| 0.5 | Balanced (arithmetic mean) | General evaluation |
| 1.0 | Lenient (close to maximum) | Creative tasks |
Learn more about Score Aggregation
Installation¶
Requires Python 3.9+. See Installation Guide for detailed instructions.