Skip to content

Test Case Generation

Eval AI Library includes a powerful test case generator that creates evaluation datasets from your documents. It supports 15+ document formats including PDF, DOCX, CSV, JSON, HTML, and images with OCR.

Supported Formats

Category Formats
Text .txt, .md, .rtf, .xml, .json, .yaml, .html
Office .pdf, .docx, .docm, .xlsx, .pptx
Data .csv, .tsv
Images .png, .jpg, .jpeg (with OCR via Tesseract)

Quick Start

from eval_lib import DatasetGenerator

generator = DatasetGenerator(
    model="gpt-4o",
    input_format="question",
    expected_output_format="answer",
    agent_description="You are a helpful assistant that answers questions about machine learning.",
    max_rows=20,
    language="en"
)

test_cases = generator.generate(file_path="./knowledge_base.pdf")

# Returns a list of EvalTestCase objects
for tc in test_cases:
    print(f"Q: {tc.input}")
    print(f"A: {tc.expected_output}")
    print(f"Context: {tc.retrieval_context[:1]}")
    print("---")

Parameters

Parameter Type Default Description
model str required LLM for generation
input_format str required How inputs are formatted (e.g., "question")
expected_output_format str required Expected answer format (e.g., "answer")
agent_description str None System role/context for generation
test_types list[str] None Types of test cases to generate
question_length str "mixed" "short", "medium", "long", "mixed"
question_openness str "mixed" "open", "closed", "mixed"
chunk_size int 1024 Document chunking size (characters)
chunk_overlap int 100 Overlap between chunks
max_rows int 10 Number of test cases to generate
temperature float 0.3 Generation temperature
trap_density float 0.1 Proportion of trap/adversarial questions
language str "en" Language for generated test cases
embedding_model str "openai:text-embedding-3-small" Model for semantic similarity
relevance_margin float 1.5 Threshold for context relevance

Document Loading

You can also use the document loader directly:

from eval_lib import DocumentLoader

# Load documents
docs = DocumentLoader.load_documents("./data/report.pdf")

# Chunk for RAG evaluation
chunks = DocumentLoader.chunk_documents(docs, chunk_size=1024, overlap=100)

Advanced Examples

Generate from Multiple Sources

generator = DatasetGenerator(
    model="gpt-4o",
    input_format="technical question",
    expected_output_format="detailed technical answer",
    max_rows=50,
    language="en"
)

# Generate from different document types
test_cases_pdf = generator.generate(file_path="./docs/api_reference.pdf")
test_cases_md = generator.generate(file_path="./docs/user_guide.md")
test_cases_csv = generator.generate(file_path="./data/faq.csv")

all_test_cases = test_cases_pdf + test_cases_md + test_cases_csv

Customize Question Types

# Short, factual questions
generator = DatasetGenerator(
    model="gpt-4o",
    input_format="question",
    expected_output_format="brief factual answer",
    question_length="short",
    question_openness="closed",
    max_rows=30
)

# Open-ended, detailed questions
generator = DatasetGenerator(
    model="gpt-4o",
    input_format="question",
    expected_output_format="comprehensive answer with examples",
    question_length="long",
    question_openness="open",
    max_rows=15
)

With Trap Questions

Trap questions test the AI's ability to say "I don't know" when the answer isn't in the context:

generator = DatasetGenerator(
    model="gpt-4o",
    input_format="question",
    expected_output_format="answer",
    trap_density=0.2,  # 20% of questions will be traps
    max_rows=50
)

Multilingual Generation

# Russian
generator = DatasetGenerator(
    model="gpt-4o",
    input_format="вопрос",
    expected_output_format="подробный ответ",
    language="ru",
    max_rows=20
)

# Spanish
generator = DatasetGenerator(
    model="gpt-4o",
    input_format="pregunta",
    expected_output_format="respuesta detallada",
    language="es",
    max_rows=20
)

Use Generated Test Cases

from eval_lib import evaluate, AnswerRelevancyMetric, FaithfulnessMetric

# Generate test cases
test_cases = generator.generate(file_path="./knowledge_base.pdf")

# Evaluate your RAG system
metrics = [
    AnswerRelevancyMetric(model="gpt-4o", threshold=0.7),
    FaithfulnessMetric(model="gpt-4o", threshold=0.7),
]

results = asyncio.run(evaluate(test_cases, metrics))

OCR for Images

For image-based documents, ensure Tesseract is installed:

# Extracts text from images using OCR
test_cases = generator.generate(file_path="./scanned_document.png")

See Installation for Tesseract setup instructions.