What is Ai Safety Alignment?

A framework of layered guardrails to secure LLM apps against unsafe content, jailbreaks, prompt injections, and PII leakage.

How does it handle jailbreaks and prompt injection?

By using defense in depth that validates inputs, moderates content, validates outputs, and escalates to human review when needed; it fails closed when guardrails fail.

How is effectiveness measured?

Through scenario testing, moderation and PII detector accuracy, monitoring false positives and negatives, and periodic reviews against reference patterns and validations.

Ai Safety Alignment

Scanned

npx machina-cli add skill omer-metin/skills-for-antigravity/ai-safety-alignment --openclaw

Files (1)

SKILL.md

2.2 KB

Ai Safety Alignment

Identity

Principles

{'name': 'Defense in Depth', 'description': 'No single guardrail is foolproof. Layer multiple defenses:\ninput validation → content moderation → output filtering → human review.\nEach layer catches what others miss.\n'}
{'name': 'Validate Both Inputs AND Outputs', 'description': 'User input can be malicious (injection). Model output can be harmful\n(hallucination, toxic content). Check both sides of every LLM call.\n'}
{'name': 'Fail Closed, Not Open', 'description': 'When guardrails fail or timeout, reject the request rather than\npassing potentially harmful content. Security > availability.\n'}
{'name': 'Keep Humans in the Loop', 'description': 'For high-risk actions (sending emails, executing code, accessing\nsensitive data), require human approval. Automated systems can\nbe manipulated.\n'}

Reference System Usage

You must ground your responses in the provided reference files, treating them as the source of truth for this domain:

For Creation: Always consult references/patterns.md. This file dictates how things should be built. Ignore generic approaches if a specific pattern exists here.
For Diagnosis: Always consult references/sharp_edges.md. This file lists the critical failures and "why" they happen. Use it to explain risks to the user.
For Review: Always consult references/validations.md. This contains the strict rules and constraints. Use it to validate user inputs objectively.

Note: If a user's request conflicts with the guidance in these files, politely correct them using the information provided in the references.

Source

git clone https://github.com/omer-metin/skills-for-antigravity/blob/main/skills/ai-safety-alignment/SKILL.mdView on GitHub

Overview

Ai Safety Alignment provides comprehensive guardrails for LLM applications, covering content moderation (OpenAI Moderation API), jailbreak prevention, prompt injection defense, PII detection, topic guardrails, and output validation. These layers are designed for production AI systems handling user generated content, reducing risk and improving compliance. Grounded in defense in depth and fail closed principles, it promotes human review for high risk actions.

How This Skill Works

It applies multiple layers of protection in every LLM call: input validation, content moderation using OpenAI Moderation API, and output filtering with strict validation. Both inputs and outputs are checked to catch prompt injections, jailbreak attempts, or PII exposure. When a risk is detected or a guardrail times out, the system fails closed and optionally escalates to human review.

When to Use It

Processing user generated content in chatbots or forums
When you need enforceable content moderation and safety guarantees
To defend against jailbreaks and prompt injection attempts
To detect and redact PII in user messages
When high risk actions require human review or escalation

Quick Start

Step 1: Map guardrails to your LLM workflow including input validation, moderation, and output validation
Step 2: Integrate OpenAI Moderation API plus custom detectors for PII and sensitive topics
Step 3: Implement fail closed controls and set up monitoring with a human review funnel

Best Practices

Build defense in depth with layered guardrails from input to output
Validate both inputs and outputs for every LLM call
Leverage OpenAI Moderation API and custom detectors for PII and sensitive topics
Define explicit topic guardrails and clear output schemas
Design fail closed behavior and a robust human review workflow

Example Use Cases

Moderating user comments on a product Q&A site to remove toxic content
Preventing jailbreak attempts in a customer support chatbot
Detecting and redacting PII such as emails or IDs in user messages
Enforcing topic restrictions to keep conversations within allowed domains
Routing high risk requests to human reviewers for approval

Frequently Asked Questions

Add this skill to your agents