Skip to content

Eval AI Library

Comprehensive AI System Evaluation Framework

Evaluate RAG systems, AI agents, and LLM security with 30+ metrics, multi-provider support, and advanced scoring techniques.

Install Quick Start GitHub


Key Features

  • RAG Metrics


    Answer relevancy, precision, faithfulness, contextual relevancy/precision/recall, bias and toxicity detection.

    RAG Metrics

  • Agent Metrics


    Tool correctness, task success rate, role adherence, knowledge retention, and error detection for AI agents.

    Agent Metrics

  • Security Metrics


    Prompt injection & jailbreak detection/resistance, PII leakage, harmful content, policy compliance, restricted refusal.

    Security Metrics

  • Deterministic Metrics


    Exact match, contains, regex, JSON schema validation, format/length checks, language detection — no LLM needed.

    Deterministic Metrics

  • Vector Metrics


    Semantic similarity and reference match using embedding models for meaning-based evaluation.

    Vector Metrics

  • 12 LLM Providers


    OpenAI, Azure, Gemini, Claude, DeepSeek, Qwen, Mistral, Groq, Grok, Zhipu, Ollama, and custom providers.

    LLM Providers

  • G-Eval & Custom Metrics


    State-of-the-art G-Eval with probability-weighted scoring, plus fully customizable evaluation criteria.

    Custom Evaluation

  • Test Data Generation


    Generate test cases from documents in 15+ formats including PDF, DOCX, CSV, JSON, HTML, and images with OCR.

    Data Generation


Quick Example

import asyncio
from eval_lib import evaluate, EvalTestCase, AnswerRelevancyMetric, FaithfulnessMetric

test_case = EvalTestCase(
    input="What is machine learning?",
    actual_output="Machine learning is a subset of AI that enables systems to learn from data.",
    expected_output="Machine learning is a branch of artificial intelligence focused on building systems that learn from data.",
    retrieval_context=[
        "Machine learning is a subset of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed."
    ]
)

metrics = [
    AnswerRelevancyMetric(model="gpt-4o", threshold=0.7),
    FaithfulnessMetric(model="gpt-4o", threshold=0.7),
]

results = asyncio.run(evaluate([test_case], metrics))
======================================================================
📝 Test Case 1/1
======================================================================
Input: What is machine learning?

Test Case Summary:
  ✅ Overall: PASSED
  💰 Cost: $0.003400

  Metrics Breakdown:
    ✅ Answer Relevancy: 0.92
    ✅ Faithfulness: 0.95

======================================================================
                     📋 EVALUATION SUMMARY
======================================================================

Overall Results:
  ✅ Passed: 1 / 1
  📊 Success Rate: 100.0%

Resource Usage:
  💰 Total Cost: $0.003400
  ⏱️  Total Time: 4.12s

======================================================================

Architecture Overview

graph TB
    subgraph Core["🔧 Core"]
        direction LR
        TC[EvalTestCase]
        DS[EvalDataset]
        API["evaluate()"]
        AGG[Score Aggregation]
    end

    subgraph Metrics["📊 Metrics Layer"]
        direction LR
        RAG["RAG\nRelevancy · Faithfulness\nPrecision · Recall"]
        AGENT["Agent\nTool Correctness\nTask Success · Role"]
        SEC["Security\nInjection · Jailbreak\nPII · Harmful"]
        DET["Deterministic\nRegex · JSON Schema\nExact Match"]
        VEC["Vector\nSemantic Similarity\nReference Match"]
        CUST["Custom\nG-Eval\nCustom Metric"]
    end

    subgraph Providers["☁️ Provider Interface"]
        direction LR
        BASE["BaseLLMProvider"]
        IMPL["OpenAI · Azure · Gemini · Claude\nDeepSeek · Qwen · Mistral · Groq\nGrok · Zhipu · Ollama · Custom"]
    end

    subgraph Data["📁 Data Generation"]
        direction LR
        DG[DataGenerator]
        FMT["PDF · DOCX · CSV\nJSON · HTML · Images"]
    end

    subgraph Output["📈 Output"]
        direction LR
        RES[EvalResult]
        DASH[Dashboard]
    end

    Core --- Metrics
    Metrics --- Providers
    Core --- Data
    Core --- Output
    BASE --- IMPL

Metric Categories at a Glance

Category Metrics Use Case
RAG Answer Relevancy, Precision, Faithfulness, Contextual Relevancy/Precision/Recall, Bias, Toxicity Evaluating retrieval-augmented generation pipelines
Agent Tool Correctness, Task Success, Role Adherence, Knowledge Retention, Tool Errors Evaluating AI agents and assistants
Security Prompt Injection, Jailbreak, PII Leakage, Harmful Content, Policy Compliance, Restricted Refusal Security and safety testing
Deterministic Exact Match, Contains, Starts/Ends With, Regex, JSON Schema, Format/Length Check, Language Detection, Non-Empty Rule-based checks without LLM
Vector Semantic Similarity, Reference Match Embedding-based semantic evaluation
Custom Custom Metric, G-Eval Domain-specific evaluations with custom criteria

Score Aggregation

Eval AI Library uses Temperature-Controlled Verdict Aggregation via Generalized Power Mean — a novel approach that provides nuanced scoring through configurable strictness levels.

Temperature Behavior Best For
0.1 Strict (close to minimum) Safety-critical applications
0.5 Balanced (arithmetic mean) General evaluation
1.0 Lenient (close to maximum) Creative tasks

Learn more about Score Aggregation


Installation

pip install eval-ai-library

Requires Python 3.9+. See Installation Guide for detailed instructions.