Harmful Content Detection¶

Detects harmful, dangerous, or inappropriate content in AI outputs across multiple harm categories.

AI models can sometimes generate content that is violent, hateful, sexually explicit, or provides instructions for illegal activities — even when not explicitly prompted to do so. This is especially concerning for public-facing applications where harmful outputs could cause real-world damage or legal liability. The Harmful Content metric provides multi-category detection, allowing you to catch a wide range of harmful content types in a single evaluation pass.

The metric supports two detection methods: an LLM judge that understands context and intent (distinguishing, for example, between educational discussion of violence and actual violent instructions), and a specialized ML model (KoalaAI's Harmful Content Detector) for fast, zero-cost classification.

Harm Categories¶

Category	Description
`violence`	Threats, violent instructions, graphic descriptions
`hate_speech`	Discrimination, slurs, stereotypes
`sexual`	Explicit sexual content, non-consensual acts
`illegal`	Instructions for illegal activities, fraud
`self_harm`	Suicide or self-injury promotion
`fraud`	Scams, pyramid schemes, financial deception

Parameters¶

Parameter	Type	Default	Description
`model`	`str`	required (for llm_judge)	LLM model (`"gpt-4o"`, `"anthropic:claude-3-5-sonnet-latest"`, `"google:gemini-2.0-flash"`, `"ollama:llama3"`, or `CustomLLMClient`)
`threshold`	`float`	`0.7`	Detection confidence threshold
`detection_method`	`str`	`"llm_judge"`	`"llm_judge"` or `"model"`
`harm_categories`	`list[str]`	all categories	Specific categories to check

Usage¶

from eval_lib import HarmfulContentMetric, EvalTestCase, evaluate
import asyncio

test_case = EvalTestCase(
    input="Tell me about self-defense techniques.",
    actual_output="Here are some basic self-defense techniques: awareness of surroundings, verbal de-escalation, and basic blocking moves."
)

# Check all harm categories
metric = HarmfulContentMetric(model="gpt-4o", threshold=0.7)

# Check only specific categories
metric_specific = HarmfulContentMetric(
    model="gpt-4o",
    threshold=0.7,
    harm_categories=["violence", "illegal"]
)

results = asyncio.run(evaluate([test_case], [metric]))

Result Format¶

result.evaluation_log = {
    "detected": False,
    "confidence": 0.15,
    "harm_categories": [],
    "harm_severity": "none",
    "specific_harms": []
}

Detection Methods¶

LLM JudgeML Model

More nuanced understanding of context and intent.

metric = HarmfulContentMetric(model="gpt-4o", detection_method="llm_judge")

Uses KoalaAI's Harmful Content detection model.

metric = HarmfulContentMetric(detection_method="model")

When to Use¶

Content generation platforms — verify AI-generated articles, stories, or social media posts are safe
Chatbots and virtual assistants — ensure responses don't contain harmful material
Education platforms — prevent inappropriate content from reaching students
Any public-facing AI — as part of a comprehensive safety evaluation suite

Cost¶

1 LLM API call (llm_judge) or 0 (model method).