Eval AI Library¶

Comprehensive AI System Evaluation Framework

Evaluate RAG systems, AI agents, and LLM security with 30+ metrics, multi-provider support, and advanced scoring techniques.

Install Quick Start GitHub

Key Features¶

RAG Metrics

Answer relevancy, precision, faithfulness, contextual relevancy/precision/recall, bias and toxicity detection.

RAG Metrics
Agent Metrics

Tool correctness, task success rate, role adherence, knowledge retention, and error detection for AI agents.

Agent Metrics
Security Metrics

Prompt injection & jailbreak detection/resistance, PII leakage, harmful content, policy compliance, restricted refusal.

Security Metrics
Deterministic Metrics

Exact match, contains, regex, JSON schema validation, format/length checks, language detection — no LLM needed.

Deterministic Metrics
Vector Metrics

Semantic similarity and reference match using embedding models for meaning-based evaluation.

Vector Metrics
12 LLM Providers

OpenAI, Azure, Gemini, Claude, DeepSeek, Qwen, Mistral, Groq, Grok, Zhipu, Ollama, and custom providers.

LLM Providers
G-Eval & Custom Metrics

State-of-the-art G-Eval with probability-weighted scoring, plus fully customizable evaluation criteria.

Custom Evaluation
Test Data Generation

Generate test cases from documents in 15+ formats including PDF, DOCX, CSV, JSON, HTML, and images with OCR.

Data Generation

Quick Example¶

import asyncio
from eval_lib import evaluate, EvalTestCase, AnswerRelevancyMetric, FaithfulnessMetric

test_case = EvalTestCase(
    input="What is machine learning?",
    actual_output="Machine learning is a subset of AI that enables systems to learn from data.",
    expected_output="Machine learning is a branch of artificial intelligence focused on building systems that learn from data.",
    retrieval_context=[
        "Machine learning is a subset of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed."
    ]
)

metrics = [
    AnswerRelevancyMetric(model="gpt-4o", threshold=0.7),
    FaithfulnessMetric(model="gpt-4o", threshold=0.7),
]

results = asyncio.run(evaluate([test_case], metrics))

======================================================================
📝 Test Case 1/1
======================================================================
Input: What is machine learning?

Test Case Summary:
  ✅ Overall: PASSED
  💰 Cost: $0.003400

  Metrics Breakdown:
    ✅ Answer Relevancy: 0.92
    ✅ Faithfulness: 0.95

======================================================================
                     📋 EVALUATION SUMMARY
======================================================================

Overall Results:
  ✅ Passed: 1 / 1
  📊 Success Rate: 100.0%

Resource Usage:
  💰 Total Cost: $0.003400
  ⏱️  Total Time: 4.12s

======================================================================

Architecture Overview¶

graph TB
    subgraph Core["🔧 Core"]
        direction LR
        TC[EvalTestCase]
        DS[EvalDataset]
        API["evaluate()"]
        AGG[Score Aggregation]
    end

    subgraph Metrics["📊 Metrics Layer"]
        direction LR
        RAG["RAG\nRelevancy · Faithfulness\nPrecision · Recall"]
        AGENT["Agent\nTool Correctness\nTask Success · Role"]
        SEC["Security\nInjection · Jailbreak\nPII · Harmful"]
        DET["Deterministic\nRegex · JSON Schema\nExact Match"]
        VEC["Vector\nSemantic Similarity\nReference Match"]
        CUST["Custom\nG-Eval\nCustom Metric"]
    end

    subgraph Providers["☁️ Provider Interface"]
        direction LR
        BASE["BaseLLMProvider"]
        IMPL["OpenAI · Azure · Gemini · Claude\nDeepSeek · Qwen · Mistral · Groq\nGrok · Zhipu · Ollama · Custom"]
    end

    subgraph Data["📁 Data Generation"]
        direction LR
        DG[DataGenerator]
        FMT["PDF · DOCX · CSV\nJSON · HTML · Images"]
    end

    subgraph Output["📈 Output"]
        direction LR
        RES[EvalResult]
        DASH[Dashboard]
    end

    Core --- Metrics
    Metrics --- Providers
    Core --- Data
    Core --- Output
    BASE --- IMPL

Metric Categories at a Glance¶

Category	Metrics	Use Case
RAG	Answer Relevancy, Precision, Faithfulness, Contextual Relevancy/Precision/Recall, Bias, Toxicity	Evaluating retrieval-augmented generation pipelines
Agent	Tool Correctness, Task Success, Role Adherence, Knowledge Retention, Tool Errors	Evaluating AI agents and assistants
Security	Prompt Injection, Jailbreak, PII Leakage, Harmful Content, Policy Compliance, Restricted Refusal	Security and safety testing
Deterministic	Exact Match, Contains, Starts/Ends With, Regex, JSON Schema, Format/Length Check, Language Detection, Non-Empty	Rule-based checks without LLM
Vector	Semantic Similarity, Reference Match	Embedding-based semantic evaluation
Custom	Custom Metric, G-Eval	Domain-specific evaluations with custom criteria

Score Aggregation¶

Eval AI Library uses Temperature-Controlled Verdict Aggregation via Generalized Power Mean — a novel approach that provides nuanced scoring through configurable strictness levels.

Temperature	Behavior	Best For
0.1	Strict (close to minimum)	Safety-critical applications
0.5	Balanced (arithmetic mean)	General evaluation
1.0	Lenient (close to maximum)	Creative tasks

Learn more about Score Aggregation

Installation¶

pip install eval-ai-library

Requires Python 3.9+. See Installation Guide for detailed instructions.