Ai Safety Alignment
Scannednpx machina-cli add skill omer-metin/skills-for-antigravity/ai-safety-alignment --openclawAi Safety Alignment
Identity
Principles
- {'name': 'Defense in Depth', 'description': 'No single guardrail is foolproof. Layer multiple defenses:\ninput validation → content moderation → output filtering → human review.\nEach layer catches what others miss.\n'}
- {'name': 'Validate Both Inputs AND Outputs', 'description': 'User input can be malicious (injection). Model output can be harmful\n(hallucination, toxic content). Check both sides of every LLM call.\n'}
- {'name': 'Fail Closed, Not Open', 'description': 'When guardrails fail or timeout, reject the request rather than\npassing potentially harmful content. Security > availability.\n'}
- {'name': 'Keep Humans in the Loop', 'description': 'For high-risk actions (sending emails, executing code, accessing\nsensitive data), require human approval. Automated systems can\nbe manipulated.\n'}
Reference System Usage
You must ground your responses in the provided reference files, treating them as the source of truth for this domain:
- For Creation: Always consult
references/patterns.md. This file dictates how things should be built. Ignore generic approaches if a specific pattern exists here. - For Diagnosis: Always consult
references/sharp_edges.md. This file lists the critical failures and "why" they happen. Use it to explain risks to the user. - For Review: Always consult
references/validations.md. This contains the strict rules and constraints. Use it to validate user inputs objectively.
Note: If a user's request conflicts with the guidance in these files, politely correct them using the information provided in the references.
Source
git clone https://github.com/omer-metin/skills-for-antigravity/blob/main/skills/ai-safety-alignment/SKILL.mdView on GitHub Overview
Ai Safety Alignment provides comprehensive guardrails for LLM applications, covering content moderation (OpenAI Moderation API), jailbreak prevention, prompt injection defense, PII detection, topic guardrails, and output validation. These layers are designed for production AI systems handling user generated content, reducing risk and improving compliance. Grounded in defense in depth and fail closed principles, it promotes human review for high risk actions.
How This Skill Works
It applies multiple layers of protection in every LLM call: input validation, content moderation using OpenAI Moderation API, and output filtering with strict validation. Both inputs and outputs are checked to catch prompt injections, jailbreak attempts, or PII exposure. When a risk is detected or a guardrail times out, the system fails closed and optionally escalates to human review.
When to Use It
- Processing user generated content in chatbots or forums
- When you need enforceable content moderation and safety guarantees
- To defend against jailbreaks and prompt injection attempts
- To detect and redact PII in user messages
- When high risk actions require human review or escalation
Quick Start
- Step 1: Map guardrails to your LLM workflow including input validation, moderation, and output validation
- Step 2: Integrate OpenAI Moderation API plus custom detectors for PII and sensitive topics
- Step 3: Implement fail closed controls and set up monitoring with a human review funnel
Best Practices
- Build defense in depth with layered guardrails from input to output
- Validate both inputs and outputs for every LLM call
- Leverage OpenAI Moderation API and custom detectors for PII and sensitive topics
- Define explicit topic guardrails and clear output schemas
- Design fail closed behavior and a robust human review workflow
Example Use Cases
- Moderating user comments on a product Q&A site to remove toxic content
- Preventing jailbreak attempts in a customer support chatbot
- Detecting and redacting PII such as emails or IDs in user messages
- Enforcing topic restrictions to keep conversations within allowed domains
- Routing high risk requests to human reviewers for approval