Test Case Generation¶

Eval AI Library includes a powerful test case generator that creates evaluation datasets from your documents. It supports 15+ document formats including PDF, DOCX, CSV, JSON, HTML, and images with OCR.

Supported Formats¶

Category	Formats
Text	`.txt`, `.md`, `.rtf`, `.xml`, `.json`, `.yaml`, `.html`
Office	`.pdf`, `.docx`, `.docm`, `.xlsx`, `.pptx`
Data	`.csv`, `.tsv`
Images	`.png`, `.jpg`, `.jpeg` (with OCR via Tesseract)

Quick Start¶

from eval_lib import DatasetGenerator

generator = DatasetGenerator(
    model="gpt-4o",
    input_format="question",
    expected_output_format="answer",
    agent_description="You are a helpful assistant that answers questions about machine learning.",
    max_rows=20,
    language="en"
)

test_cases = generator.generate(file_path="./knowledge_base.pdf")

# Returns a list of EvalTestCase objects
for tc in test_cases:
    print(f"Q: {tc.input}")
    print(f"A: {tc.expected_output}")
    print(f"Context: {tc.retrieval_context[:1]}")
    print("---")

Parameters¶

Parameter	Type	Default	Description
`model`	`str`	required	LLM for generation
`input_format`	`str`	required	How inputs are formatted (e.g., "question")
`expected_output_format`	`str`	required	Expected answer format (e.g., "answer")
`agent_description`	`str`	`None`	System role/context for generation
`test_types`	`list[str]`	`None`	Types of test cases to generate
`question_length`	`str`	`"mixed"`	`"short"`, `"medium"`, `"long"`, `"mixed"`
`question_openness`	`str`	`"mixed"`	`"open"`, `"closed"`, `"mixed"`
`chunk_size`	`int`	`1024`	Document chunking size (characters)
`chunk_overlap`	`int`	`100`	Overlap between chunks
`max_rows`	`int`	`10`	Number of test cases to generate
`temperature`	`float`	`0.3`	Generation temperature
`trap_density`	`float`	`0.1`	Proportion of trap/adversarial questions
`language`	`str`	`"en"`	Language for generated test cases
`embedding_model`	`str`	`"openai:text-embedding-3-small"`	Model for semantic similarity
`relevance_margin`	`float`	`1.5`	Threshold for context relevance

Document Loading¶

You can also use the document loader directly:

from eval_lib import DocumentLoader

# Load documents
docs = DocumentLoader.load_documents("./data/report.pdf")

# Chunk for RAG evaluation
chunks = DocumentLoader.chunk_documents(docs, chunk_size=1024, overlap=100)

Advanced Examples¶

Generate from Multiple Sources¶

generator = DatasetGenerator(
    model="gpt-4o",
    input_format="technical question",
    expected_output_format="detailed technical answer",
    max_rows=50,
    language="en"
)

# Generate from different document types
test_cases_pdf = generator.generate(file_path="./docs/api_reference.pdf")
test_cases_md = generator.generate(file_path="./docs/user_guide.md")
test_cases_csv = generator.generate(file_path="./data/faq.csv")

all_test_cases = test_cases_pdf + test_cases_md + test_cases_csv

Customize Question Types¶

# Short, factual questions
generator = DatasetGenerator(
    model="gpt-4o",
    input_format="question",
    expected_output_format="brief factual answer",
    question_length="short",
    question_openness="closed",
    max_rows=30
)

# Open-ended, detailed questions
generator = DatasetGenerator(
    model="gpt-4o",
    input_format="question",
    expected_output_format="comprehensive answer with examples",
    question_length="long",
    question_openness="open",
    max_rows=15
)

With Trap Questions¶

Trap questions test the AI's ability to say "I don't know" when the answer isn't in the context:

generator = DatasetGenerator(
    model="gpt-4o",
    input_format="question",
    expected_output_format="answer",
    trap_density=0.2,  # 20% of questions will be traps
    max_rows=50
)

Multilingual Generation¶

# Russian
generator = DatasetGenerator(
    model="gpt-4o",
    input_format="вопрос",
    expected_output_format="подробный ответ",
    language="ru",
    max_rows=20
)

# Spanish
generator = DatasetGenerator(
    model="gpt-4o",
    input_format="pregunta",
    expected_output_format="respuesta detallada",
    language="es",
    max_rows=20
)

Use Generated Test Cases¶

from eval_lib import evaluate, AnswerRelevancyMetric, FaithfulnessMetric

# Generate test cases
test_cases = generator.generate(file_path="./knowledge_base.pdf")

# Evaluate your RAG system
metrics = [
    AnswerRelevancyMetric(model="gpt-4o", threshold=0.7),
    FaithfulnessMetric(model="gpt-4o", threshold=0.7),
]

results = asyncio.run(evaluate(test_cases, metrics))

OCR for Images¶

For image-based documents, ensure Tesseract is installed:

# Extracts text from images using OCR
test_cases = generator.generate(file_path="./scanned_document.png")

See Installation for Tesseract setup instructions.