Eval AI Library¶
Comprehensive AI System Evaluation Framework
Evaluate RAG systems, AI agents, and LLM security with 30+ metrics, multi-provider support, and advanced scoring techniques.
Key Features¶
-
RAG Metrics
Answer relevancy, precision, faithfulness, contextual relevancy/precision/recall, bias and toxicity detection.
-
Agent Metrics
Tool correctness, task success rate, role adherence, knowledge retention, and error detection for AI agents.
-
Security Metrics
Prompt injection & jailbreak detection/resistance, PII leakage, harmful content, policy compliance, restricted refusal.
-
Deterministic Metrics
Exact match, contains, regex, JSON schema validation, format/length checks, language detection — no LLM needed.
-
Vector Metrics
Semantic similarity and reference match using embedding models for meaning-based evaluation.
-
12 LLM Providers
OpenAI, Azure, Gemini, Claude, DeepSeek, Qwen, Mistral, Groq, Grok, Zhipu, Ollama, and custom providers.
-
G-Eval & Custom Metrics
State-of-the-art G-Eval with probability-weighted scoring, plus fully customizable evaluation criteria.
-
Test Data Generation
Generate test cases from documents in 15+ formats including PDF, DOCX, CSV, JSON, HTML, and images with OCR.
Quick Example¶
import asyncio
from eval_lib import evaluate, EvalTestCase, AnswerRelevancyMetric, FaithfulnessMetric
test_case = EvalTestCase(
input="What is machine learning?",
actual_output="Machine learning is a subset of AI that enables systems to learn from data.",
expected_output="Machine learning is a branch of artificial intelligence focused on building systems that learn from data.",
retrieval_context=[
"Machine learning is a subset of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed."
]
)
metrics = [
AnswerRelevancyMetric(model="gpt-4o", threshold=0.7),
FaithfulnessMetric(model="gpt-4o", threshold=0.7),
]
results = asyncio.run(evaluate([test_case], metrics))
======================================================================
📝 Test Case 1/1
======================================================================
Input: What is machine learning?
Test Case Summary:
✅ Overall: PASSED
💰 Cost: $0.003400
Metrics Breakdown:
✅ Answer Relevancy: 0.92
✅ Faithfulness: 0.95
======================================================================
📋 EVALUATION SUMMARY
======================================================================
Overall Results:
✅ Passed: 1 / 1
📊 Success Rate: 100.0%
Resource Usage:
💰 Total Cost: $0.003400
⏱️ Total Time: 4.12s
======================================================================
Architecture Overview¶
graph TB
subgraph Core["🔧 Core"]
direction LR
TC[EvalTestCase]
DS[EvalDataset]
API["evaluate()"]
AGG[Score Aggregation]
end
subgraph Metrics["📊 Metrics Layer"]
direction LR
RAG["RAG\nRelevancy · Faithfulness\nPrecision · Recall"]
AGENT["Agent\nTool Correctness\nTask Success · Role"]
SEC["Security\nInjection · Jailbreak\nPII · Harmful"]
DET["Deterministic\nRegex · JSON Schema\nExact Match"]
VEC["Vector\nSemantic Similarity\nReference Match"]
CUST["Custom\nG-Eval\nCustom Metric"]
end
subgraph Providers["☁️ Provider Interface"]
direction LR
BASE["BaseLLMProvider"]
IMPL["OpenAI · Azure · Gemini · Claude\nDeepSeek · Qwen · Mistral · Groq\nGrok · Zhipu · Ollama · Custom"]
end
subgraph Data["📁 Data Generation"]
direction LR
DG[DataGenerator]
FMT["PDF · DOCX · CSV\nJSON · HTML · Images"]
end
subgraph Output["📈 Output"]
direction LR
RES[EvalResult]
DASH[Dashboard]
end
Core --- Metrics
Metrics --- Providers
Core --- Data
Core --- Output
BASE --- IMPL Metric Categories at a Glance¶
| Category | Metrics | Use Case |
|---|---|---|
| RAG | Answer Relevancy, Precision, Faithfulness, Contextual Relevancy/Precision/Recall, Bias, Toxicity | Evaluating retrieval-augmented generation pipelines |
| Agent | Tool Correctness, Task Success, Role Adherence, Knowledge Retention, Tool Errors | Evaluating AI agents and assistants |
| Security | Prompt Injection, Jailbreak, PII Leakage, Harmful Content, Policy Compliance, Restricted Refusal | Security and safety testing |
| Deterministic | Exact Match, Contains, Starts/Ends With, Regex, JSON Schema, Format/Length Check, Language Detection, Non-Empty | Rule-based checks without LLM |
| Vector | Semantic Similarity, Reference Match | Embedding-based semantic evaluation |
| Custom | Custom Metric, G-Eval | Domain-specific evaluations with custom criteria |
Score Aggregation¶
Eval AI Library uses Temperature-Controlled Verdict Aggregation via Generalized Power Mean — a novel approach that provides nuanced scoring through configurable strictness levels.
| Temperature | Behavior | Best For |
|---|---|---|
| 0.1 | Strict (close to minimum) | Safety-critical applications |
| 0.5 | Balanced (arithmetic mean) | General evaluation |
| 1.0 | Lenient (close to maximum) | Creative tasks |
Learn more about Score Aggregation
Installation¶
Requires Python 3.9+. See Installation Guide for detailed instructions.