Skip to content

Eval AI Library

Comprehensive AI Model Evaluation Framework

Evaluate RAG systems, AI agents, and LLM security with 15+ metrics, multi-provider support, and advanced scoring techniques.

Install Quick Start GitHub


Key Features

  • 15+ Evaluation Metrics


    RAG metrics, agent evaluations, security checks, and custom metrics — everything you need to evaluate AI systems comprehensively.

    RAG Metrics

  • Agent Evaluation


    Tool correctness, task success rate, role adherence, knowledge retention, and error detection for AI agents.

    Agent Metrics

  • Security Testing


    Prompt injection & jailbreak detection/resistance, PII leakage, harmful content detection, and policy compliance.

    Security Metrics

  • Multi-Provider Support


    OpenAI, Azure OpenAI, Google Gemini, Anthropic Claude, Ollama, and custom LLM providers through a unified interface.

    LLM Providers

  • G-Eval & Custom Metrics


    State-of-the-art G-Eval with probability-weighted scoring, plus fully customizable evaluation criteria.

    Custom Evaluation

  • Test Data Generation


    Generate test cases from documents in 15+ formats including PDF, DOCX, CSV, JSON, HTML, and images with OCR.

    Data Generation


Quick Example

import asyncio
from eval_lib import evaluate, EvalTestCase, AnswerRelevancyMetric, FaithfulnessMetric

test_case = EvalTestCase(
    input="What is machine learning?",
    actual_output="Machine learning is a subset of AI that enables systems to learn from data.",
    expected_output="Machine learning is a branch of artificial intelligence focused on building systems that learn from data.",
    retrieval_context=[
        "Machine learning is a subset of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed."
    ]
)

metrics = [
    AnswerRelevancyMetric(model="gpt-4o", threshold=0.7),
    FaithfulnessMetric(model="gpt-4o", threshold=0.7),
]

results = asyncio.run(evaluate([test_case], metrics))
======================================================================
📝 Test Case 1/1
======================================================================
Input: What is machine learning?

Test Case Summary:
  ✅ Overall: PASSED
  💰 Cost: $0.003400

  Metrics Breakdown:
    ✅ Answer Relevancy: 0.92
    ✅ Faithfulness: 0.95

======================================================================
                     📋 EVALUATION SUMMARY
======================================================================

Overall Results:
  ✅ Passed: 1 / 1
  📊 Success Rate: 100.0%

Resource Usage:
  💰 Total Cost: $0.003400
  ⏱️  Total Time: 4.12s

======================================================================

Architecture Overview

graph TD
    A[Test Cases] --> B[Evaluation Engine]
    B --> C{Metrics}
    C --> D[RAG Metrics]
    C --> E[Agent Metrics]
    C --> F[Security Metrics]
    C --> G[Custom / G-Eval]

    D --> H[LLM Provider]
    E --> H
    F --> H
    G --> H

    H --> I[OpenAI]
    H --> J[Azure]
    H --> K[Gemini]
    H --> L[Claude]
    H --> M[Ollama]
    H --> N[Custom]

    B --> O[Results & Dashboard]

Metric Categories at a Glance

Category Metrics Use Case
RAG Answer Relevancy, Precision, Faithfulness, Contextual Relevancy/Precision/Recall, Bias, Toxicity Evaluating retrieval-augmented generation pipelines
Agent Tool Correctness, Task Success, Role Adherence, Knowledge Retention, Tool Errors Evaluating AI agents and assistants
Security Prompt Injection, Jailbreak, PII Leakage, Harmful Content, Policy Compliance Security and safety testing
Custom Custom Metric, G-Eval Domain-specific evaluations with custom criteria

Score Aggregation

Eval AI Library uses Temperature-Controlled Verdict Aggregation via Generalized Power Mean — a novel approach that provides nuanced scoring through configurable strictness levels.

Temperature Behavior Best For
0.1 Strict (close to minimum) Safety-critical applications
0.5 Balanced (arithmetic mean) General evaluation
1.0 Lenient (close to maximum) Creative tasks

Learn more about Score Aggregation


Installation

pip install eval-ai-library

Requires Python 3.9+. See Installation Guide for detailed instructions.