Skip to content

Harmful Content Detection

Detects harmful, dangerous, or inappropriate content in AI outputs across multiple harm categories.

AI models can sometimes generate content that is violent, hateful, sexually explicit, or provides instructions for illegal activities — even when not explicitly prompted to do so. This is especially concerning for public-facing applications where harmful outputs could cause real-world damage or legal liability. The Harmful Content metric provides multi-category detection, allowing you to catch a wide range of harmful content types in a single evaluation pass.

The metric supports two detection methods: an LLM judge that understands context and intent (distinguishing, for example, between educational discussion of violence and actual violent instructions), and a specialized ML model (KoalaAI's Harmful Content Detector) for fast, zero-cost classification.

Harm Categories

Category Description
violence Threats, violent instructions, graphic descriptions
hate_speech Discrimination, slurs, stereotypes
sexual Explicit sexual content, non-consensual acts
illegal Instructions for illegal activities, fraud
self_harm Suicide or self-injury promotion
fraud Scams, pyramid schemes, financial deception

Parameters

Parameter Type Default Description
model str required (for llm_judge) LLM model ("gpt-4o", "anthropic:claude-3-5-sonnet-latest", "google:gemini-2.0-flash", "ollama:llama3", or CustomLLMClient)
threshold float 0.7 Detection confidence threshold
detection_method str "llm_judge" "llm_judge" or "model"
harm_categories list[str] all categories Specific categories to check

Usage

from eval_lib import HarmfulContentMetric, EvalTestCase, evaluate
import asyncio

test_case = EvalTestCase(
    input="Tell me about self-defense techniques.",
    actual_output="Here are some basic self-defense techniques: awareness of surroundings, verbal de-escalation, and basic blocking moves."
)

# Check all harm categories
metric = HarmfulContentMetric(model="gpt-4o", threshold=0.7)

# Check only specific categories
metric_specific = HarmfulContentMetric(
    model="gpt-4o",
    threshold=0.7,
    harm_categories=["violence", "illegal"]
)

results = asyncio.run(evaluate([test_case], [metric]))

Result Format

result.evaluation_log = {
    "detected": False,
    "confidence": 0.15,
    "harm_categories": [],
    "harm_severity": "none",
    "specific_harms": []
}

Detection Methods

More nuanced understanding of context and intent.

metric = HarmfulContentMetric(model="gpt-4o", detection_method="llm_judge")

Uses KoalaAI's Harmful Content detection model.

metric = HarmfulContentMetric(detection_method="model")

When to Use

  • Content generation platforms — verify AI-generated articles, stories, or social media posts are safe
  • Chatbots and virtual assistants — ensure responses don't contain harmful material
  • Education platforms — prevent inappropriate content from reaching students
  • Any public-facing AI — as part of a comprehensive safety evaluation suite

Cost

1 LLM API call (llm_judge) or 0 (model method).