Which APIs are supported?

OpenAI Moderation, Perspective API, Azure Content Safety, and LlamaGuard (open-source safety classifier) are supported.

How are thresholds applied?

Thresholds are configured per category and influence actions (block, warn, flag) with fallback behavior for edge cases.

How are moderation decisions logged?

Decisions are logged and reported to support auditing, review, and compliance needs.

content-moderation-api

Scanned

npx machina-cli add skill a5c-ai/babysitter/content-moderation-api --openclaw

Files (1)

SKILL.md

1.2 KB

Content Moderation API Skill

Capabilities

Integrate OpenAI Moderation API
Set up Perspective API for toxicity detection
Configure moderation thresholds
Implement content filtering pipelines
Design moderation response handling
Create moderation logging and reporting

Target Processes

content-moderation-safety
system-prompt-guardrails

Implementation Details

Moderation APIs

OpenAI Moderation: Hate, violence, self-harm, sexual content
Perspective API: Toxicity, insult, profanity, threat
Azure Content Safety: Text and image moderation
LlamaGuard: Open-source safety classifier

Configuration Options

API credentials and endpoints
Category thresholds
Action policies (block, warn, flag)
Logging configuration
Fallback behavior

Best Practices

Set appropriate thresholds
Handle edge cases gracefully
Log moderation decisions
Regular threshold review
Multi-layer moderation

Dependencies

openai
google-cloud-language (Perspective)
azure-ai-contentsafety

Source

git clone https://github.com/a5c-ai/babysitter/blob/main/plugins/babysitter/skills/babysit/process/specializations/ai-agents-conversational/skills/content-moderation-api/SKILL.md

View on GitHub

Overview

Integrate multiple moderation APIs (OpenAI Moderation, Perspective, Azure Content Safety, and LlamaGuard) to detect and filter unsafe content. It supports configurable thresholds and action policies, plus logging and reporting to support compliance and audits. This makes it easy to enforce safe interactions across chat, forums, and user-generated content.

How This Skill Works

Content is sent to a suite of moderation APIs that classify for hate, violence, self-harm, toxicity, profanity, and more. Based on category thresholds, the system applies actions (block, warn, flag) and routes content to logs and dashboards. It supports fallback behavior and centralized policy management for consistency.

When to Use It

Moderating user-generated messages in chat applications to block hate speech and self-harm content.
Enforcing guardrails around system prompts and assistant outputs to prevent unsafe instructions.
Filtering text and image content with Azure Content Safety to handle multimedia submissions.
Tuning thresholds per product or community to balance safety and user experience.
Auditing, logging, and reporting moderation decisions for compliance and reviews.

Quick Start

Step 1: Acquire API keys and endpoints for OpenAI Moderation, Perspective API, Azure Content Safety, and (optionally) LlamaGuard.
Step 2: Configure category thresholds and action policies (block, warn, flag), logging, and fallback behavior.
Step 3: Integrate into your moderation pipeline, enable multi-layer moderation, and test with edge-case content samples.

Best Practices

Set appropriate thresholds
Handle edge cases gracefully
Log moderation decisions
Regular threshold review
Multi-layer moderation

Example Use Cases

Moderating comments on a social app or forum
Moderation in a customer support chatbot
Detecting self-harm, violence, or hate in messages
Reviewing user reviews or feedback for policy violations
Image and text moderation for user-submitted media via Azure Content Safety

Frequently Asked Questions

Add this skill to your agents