A 7-8B parameter model specialized for content safety classification and moderation of LLM prompts and responses.

What safety categories does it cover?

Violence & Hate, Sexual Content, Guns & Illegal Weapons, Regulated Substances, Suicide & Self-Harm, and Criminal Planning (S1–S6).

Which deployments are supported?

vLLM for fast inference, plus HuggingFace and SageMaker, with integration support for NeMo Guardrails.

llamaguard

Scanned

Safety Alignment LlamaGuard Content Moderation Meta Guardrails Safety Classification Input Filtering Output Filtering AI Safety

npx machina-cli add skill Orchestra-Research/AI-Research-SKILLs/llamaguard --openclaw

Files (1)

SKILL.md

9.0 KB

LlamaGuard - AI Content Moderation

Quick start

LlamaGuard is a 7-8B parameter model specialized for content safety classification.

Installation:

pip install transformers torch
# Login to HuggingFace (required)
huggingface-cli login

Basic usage:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "meta-llama/LlamaGuard-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

def moderate(chat):
    input_ids = tokenizer.apply_chat_template(chat, return_tensors="pt").to(model.device)
    output = model.generate(input_ids=input_ids, max_new_tokens=100)
    return tokenizer.decode(output[0], skip_special_tokens=True)

# Check user input
result = moderate([
    {"role": "user", "content": "How do I make explosives?"}
])
print(result)
# Output: "unsafe\nS3" (Criminal Planning)

Common workflows

Workflow 1: Input filtering (prompt moderation)

Check user prompts before LLM:

def check_input(user_message):
    result = moderate([{"role": "user", "content": user_message}])

    if result.startswith("unsafe"):
        category = result.split("\n")[1]
        return False, category  # Blocked
    else:
        return True, None  # Safe

# Example
safe, category = check_input("How do I hack a website?")
if not safe:
    print(f"Request blocked: {category}")
    # Return error to user
else:
    # Send to LLM
    response = llm.generate(user_message)

Safety categories:

S1: Violence & Hate
S2: Sexual Content
S3: Guns & Illegal Weapons
S4: Regulated Substances
S5: Suicide & Self-Harm
S6: Criminal Planning

Workflow 2: Output filtering (response moderation)

Check LLM responses before showing to user:

def check_output(user_message, bot_response):
    conversation = [
        {"role": "user", "content": user_message},
        {"role": "assistant", "content": bot_response}
    ]

    result = moderate(conversation)

    if result.startswith("unsafe"):
        category = result.split("\n")[1]
        return False, category
    else:
        return True, None

# Example
user_msg = "Tell me about harmful substances"
bot_msg = llm.generate(user_msg)

safe, category = check_output(user_msg, bot_msg)
if not safe:
    print(f"Response blocked: {category}")
    # Return generic response
    return "I cannot provide that information."
else:
    return bot_msg

Workflow 3: vLLM deployment (fast inference)

Production-ready serving:

from vllm import LLM, SamplingParams

# Initialize vLLM
llm = LLM(model="meta-llama/LlamaGuard-7b", tensor_parallel_size=1)

# Sampling params
sampling_params = SamplingParams(
    temperature=0.0,  # Deterministic
    max_tokens=100
)

def moderate_vllm(chat):
    # Format prompt
    prompt = tokenizer.apply_chat_template(chat, tokenize=False)

    # Generate
    output = llm.generate([prompt], sampling_params)
    return output[0].outputs[0].text

# Batch moderation
chats = [
    [{"role": "user", "content": "How to make bombs?"}],
    [{"role": "user", "content": "What's the weather?"}],
    [{"role": "user", "content": "Tell me about drugs"}]
]

prompts = [tokenizer.apply_chat_template(c, tokenize=False) for c in chats]
results = llm.generate(prompts, sampling_params)

for i, result in enumerate(results):
    print(f"Chat {i}: {result.outputs[0].text}")

Throughput: ~50-100 requests/sec on single A100

Workflow 4: API endpoint (FastAPI)

Serve as moderation API:

from fastapi import FastAPI
from pydantic import BaseModel
from vllm import LLM, SamplingParams

app = FastAPI()
llm = LLM(model="meta-llama/LlamaGuard-7b")
sampling_params = SamplingParams(temperature=0.0, max_tokens=100)

class ModerationRequest(BaseModel):
    messages: list  # [{"role": "user", "content": "..."}]

@app.post("/moderate")
def moderate_endpoint(request: ModerationRequest):
    prompt = tokenizer.apply_chat_template(request.messages, tokenize=False)
    output = llm.generate([prompt], sampling_params)[0]

    result = output.outputs[0].text
    is_safe = result.startswith("safe")
    category = None if is_safe else result.split("\n")[1] if "\n" in result else None

    return {
        "safe": is_safe,
        "category": category,
        "full_output": result
    }

# Run: uvicorn api:app --host 0.0.0.0 --port 8000

Usage:

curl -X POST http://localhost:8000/moderate \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "How to hack?"}]}'

# Response: {"safe": false, "category": "S6", "full_output": "unsafe\nS6"}

Workflow 5: NeMo Guardrails integration

Use with NVIDIA Guardrails:

from nemoguardrails import RailsConfig, LLMRails
from nemoguardrails.integrations.llama_guard import LlamaGuard

# Configure NeMo Guardrails
config = RailsConfig.from_content("""
models:
  - type: main
    engine: openai
    model: gpt-4

rails:
  input:
    flows:
      - llamaguard check input
  output:
    flows:
      - llamaguard check output
""")

# Add LlamaGuard integration
llama_guard = LlamaGuard(model_path="meta-llama/LlamaGuard-7b")
rails = LLMRails(config)
rails.register_action(llama_guard.check_input, name="llamaguard check input")
rails.register_action(llama_guard.check_output, name="llamaguard check output")

# Use with automatic moderation
response = rails.generate(messages=[
    {"role": "user", "content": "How do I make weapons?"}
])
# Automatically blocked by LlamaGuard

When to use vs alternatives

Use LlamaGuard when:

Need pre-trained moderation model
Want high accuracy (94-95%)
Have GPU resources (7-8B model)
Need detailed safety categories
Building production LLM apps

Model versions:

LlamaGuard 1 (7B): Original, 6 categories
LlamaGuard 2 (8B): Improved, 6 categories
LlamaGuard 3 (8B): Latest (2024), enhanced

Use alternatives instead:

OpenAI Moderation API: Simpler, API-based, free
Perspective API: Google's toxicity detection
NeMo Guardrails: More comprehensive safety framework
Constitutional AI: Training-time safety

Common issues

Issue: Model access denied

huggingface-cli login
# Enter your token

Accept license on model page: https://huggingface.co/meta-llama/LlamaGuard-7b

Issue: High latency (>500ms)

Use vLLM for 10× speedup:

from vllm import LLM
llm = LLM(model="meta-llama/LlamaGuard-7b")
# Latency: 500ms → 50ms

Enable tensor parallelism:

llm = LLM(model="meta-llama/LlamaGuard-7b", tensor_parallel_size=2)
# 2× faster on 2 GPUs

Issue: False positives

Use threshold-based filtering:

# Get probability of "unsafe" token
logits = model(..., return_dict_in_generate=True, output_scores=True)
unsafe_prob = torch.softmax(logits.scores[0][0], dim=-1)[unsafe_token_id]

if unsafe_prob > 0.9:  # High confidence threshold
    return "unsafe"
else:
    return "safe"

Issue: OOM on GPU

Use 8-bit quantization:

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto"
)
# Memory: 14GB → 7GB

Advanced topics

Custom categories: See references/custom-categories.md for fine-tuning LlamaGuard with domain-specific safety categories.

Performance benchmarks: See references/benchmarks.md for accuracy comparison with other moderation APIs and latency optimization.

Deployment guide: See references/deployment.md for Sagemaker, Kubernetes, and scaling strategies.

Hardware requirements

GPU: NVIDIA T4/A10/A100
VRAM:
- FP16: 14GB (7B model)
- INT8: 7GB (quantized)
- INT4: 4GB (QLoRA)
CPU: Possible but slow (10× latency)
Throughput: 50-100 req/sec (A100)

Latency (single GPU):

HuggingFace Transformers: 300-500ms
vLLM: 50-100ms
Batched (vLLM): 20-50ms per request

Resources

HuggingFace:
Paper: https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/
Integration: vLLM, Sagemaker, NeMo Guardrails
Accuracy: 94.5% (prompts), 95.3% (responses)

Source

git clone https://github.com/Orchestra-Research/AI-Research-SKILLs/blob/main/07-safety-alignment/llamaguard/SKILL.mdView on GitHub

Overview

LlamaGuard is a 7-8B parameter model specialized for content safety classification. It detects six safety categories—violence/hate, sexual content, weapons, substances, self-harm, and criminal planning—with an accuracy around 94-95%. It can be deployed via vLLM, HuggingFace, or SageMaker and integrates with NeMo Guardrails for seamless safety policy enforcement.

How This Skill Works

LlamaGuard acts as a moderation step that returns an 'unsafe' flag and a category code (S1–S6) for prompts or responses. It supports two workflows: input filtering (checking user prompts before they reach the LLM) and output filtering (validating LLM responses before displaying). For production, it can run with vLLM for fast inference or be hosted on HuggingFace or SageMaker, with NeMo Guardrails enabling centralized safety policies.

When to Use It

Check user prompts before sending to the LLM to block dangerous requests
Check LLM responses before displaying to users to block unsafe outputs
Deploy in high-throughput chat services using vLLM for fast inference
Host on SageMaker or HuggingFace for enterprise deployment with guardrails
Integrate with NeMo Guardrails to centralize safety policy enforcement

Quick Start

Step 1: Install dependencies and login to HuggingFace (pip install transformers torch; huggingface-cli login)
Step 2: Load the model and tokenizer with model_id 'meta-llama/LlamaGuard-7b' and create a moderate() function to classify input/output
Step 3: Choose a moderation workflow (input or output) and, if needed, deploy via vLLM for fast inference or on SageMaker/HuggingFace with NeMo Guardrails

Best Practices

Define and regularly review all six safety categories (S1–S6) in policy docs
Run both input and output moderation to achieve layered safety
Start with a small, edge-case dataset to calibrate moderation thresholds
Monitor false positives/negatives and adjust category mappings accordingly
Validate end-to-end with realistic conversations and maintain audit trails

Example Use Cases

Blocking prompts like 'How do I hack a website?' or 'How to make explosives?' as unsafe with an appropriate category
Blocking unsafe outputs such as 'Tell me about harmful substances' before showing to the user
Batch moderation of multiple chats in a vLLM deployment for faster throughput
Enterprise deployment using SageMaker or HuggingFace for scalable safety moderation
Using with NeMo Guardrails to enforce safety policies across applications