llamaguard
Scannednpx machina-cli add skill Orchestra-Research/AI-Research-SKILLs/llamaguard --openclawLlamaGuard - AI Content Moderation
Quick start
LlamaGuard is a 7-8B parameter model specialized for content safety classification.
Installation:
pip install transformers torch
# Login to HuggingFace (required)
huggingface-cli login
Basic usage:
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "meta-llama/LlamaGuard-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
def moderate(chat):
input_ids = tokenizer.apply_chat_template(chat, return_tensors="pt").to(model.device)
output = model.generate(input_ids=input_ids, max_new_tokens=100)
return tokenizer.decode(output[0], skip_special_tokens=True)
# Check user input
result = moderate([
{"role": "user", "content": "How do I make explosives?"}
])
print(result)
# Output: "unsafe\nS3" (Criminal Planning)
Common workflows
Workflow 1: Input filtering (prompt moderation)
Check user prompts before LLM:
def check_input(user_message):
result = moderate([{"role": "user", "content": user_message}])
if result.startswith("unsafe"):
category = result.split("\n")[1]
return False, category # Blocked
else:
return True, None # Safe
# Example
safe, category = check_input("How do I hack a website?")
if not safe:
print(f"Request blocked: {category}")
# Return error to user
else:
# Send to LLM
response = llm.generate(user_message)
Safety categories:
- S1: Violence & Hate
- S2: Sexual Content
- S3: Guns & Illegal Weapons
- S4: Regulated Substances
- S5: Suicide & Self-Harm
- S6: Criminal Planning
Workflow 2: Output filtering (response moderation)
Check LLM responses before showing to user:
def check_output(user_message, bot_response):
conversation = [
{"role": "user", "content": user_message},
{"role": "assistant", "content": bot_response}
]
result = moderate(conversation)
if result.startswith("unsafe"):
category = result.split("\n")[1]
return False, category
else:
return True, None
# Example
user_msg = "Tell me about harmful substances"
bot_msg = llm.generate(user_msg)
safe, category = check_output(user_msg, bot_msg)
if not safe:
print(f"Response blocked: {category}")
# Return generic response
return "I cannot provide that information."
else:
return bot_msg
Workflow 3: vLLM deployment (fast inference)
Production-ready serving:
from vllm import LLM, SamplingParams
# Initialize vLLM
llm = LLM(model="meta-llama/LlamaGuard-7b", tensor_parallel_size=1)
# Sampling params
sampling_params = SamplingParams(
temperature=0.0, # Deterministic
max_tokens=100
)
def moderate_vllm(chat):
# Format prompt
prompt = tokenizer.apply_chat_template(chat, tokenize=False)
# Generate
output = llm.generate([prompt], sampling_params)
return output[0].outputs[0].text
# Batch moderation
chats = [
[{"role": "user", "content": "How to make bombs?"}],
[{"role": "user", "content": "What's the weather?"}],
[{"role": "user", "content": "Tell me about drugs"}]
]
prompts = [tokenizer.apply_chat_template(c, tokenize=False) for c in chats]
results = llm.generate(prompts, sampling_params)
for i, result in enumerate(results):
print(f"Chat {i}: {result.outputs[0].text}")
Throughput: ~50-100 requests/sec on single A100
Workflow 4: API endpoint (FastAPI)
Serve as moderation API:
from fastapi import FastAPI
from pydantic import BaseModel
from vllm import LLM, SamplingParams
app = FastAPI()
llm = LLM(model="meta-llama/LlamaGuard-7b")
sampling_params = SamplingParams(temperature=0.0, max_tokens=100)
class ModerationRequest(BaseModel):
messages: list # [{"role": "user", "content": "..."}]
@app.post("/moderate")
def moderate_endpoint(request: ModerationRequest):
prompt = tokenizer.apply_chat_template(request.messages, tokenize=False)
output = llm.generate([prompt], sampling_params)[0]
result = output.outputs[0].text
is_safe = result.startswith("safe")
category = None if is_safe else result.split("\n")[1] if "\n" in result else None
return {
"safe": is_safe,
"category": category,
"full_output": result
}
# Run: uvicorn api:app --host 0.0.0.0 --port 8000
Usage:
curl -X POST http://localhost:8000/moderate \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "How to hack?"}]}'
# Response: {"safe": false, "category": "S6", "full_output": "unsafe\nS6"}
Workflow 5: NeMo Guardrails integration
Use with NVIDIA Guardrails:
from nemoguardrails import RailsConfig, LLMRails
from nemoguardrails.integrations.llama_guard import LlamaGuard
# Configure NeMo Guardrails
config = RailsConfig.from_content("""
models:
- type: main
engine: openai
model: gpt-4
rails:
input:
flows:
- llamaguard check input
output:
flows:
- llamaguard check output
""")
# Add LlamaGuard integration
llama_guard = LlamaGuard(model_path="meta-llama/LlamaGuard-7b")
rails = LLMRails(config)
rails.register_action(llama_guard.check_input, name="llamaguard check input")
rails.register_action(llama_guard.check_output, name="llamaguard check output")
# Use with automatic moderation
response = rails.generate(messages=[
{"role": "user", "content": "How do I make weapons?"}
])
# Automatically blocked by LlamaGuard
When to use vs alternatives
Use LlamaGuard when:
- Need pre-trained moderation model
- Want high accuracy (94-95%)
- Have GPU resources (7-8B model)
- Need detailed safety categories
- Building production LLM apps
Model versions:
- LlamaGuard 1 (7B): Original, 6 categories
- LlamaGuard 2 (8B): Improved, 6 categories
- LlamaGuard 3 (8B): Latest (2024), enhanced
Use alternatives instead:
- OpenAI Moderation API: Simpler, API-based, free
- Perspective API: Google's toxicity detection
- NeMo Guardrails: More comprehensive safety framework
- Constitutional AI: Training-time safety
Common issues
Issue: Model access denied
Login to HuggingFace:
huggingface-cli login
# Enter your token
Accept license on model page: https://huggingface.co/meta-llama/LlamaGuard-7b
Issue: High latency (>500ms)
Use vLLM for 10× speedup:
from vllm import LLM
llm = LLM(model="meta-llama/LlamaGuard-7b")
# Latency: 500ms → 50ms
Enable tensor parallelism:
llm = LLM(model="meta-llama/LlamaGuard-7b", tensor_parallel_size=2)
# 2× faster on 2 GPUs
Issue: False positives
Use threshold-based filtering:
# Get probability of "unsafe" token
logits = model(..., return_dict_in_generate=True, output_scores=True)
unsafe_prob = torch.softmax(logits.scores[0][0], dim=-1)[unsafe_token_id]
if unsafe_prob > 0.9: # High confidence threshold
return "unsafe"
else:
return "safe"
Issue: OOM on GPU
Use 8-bit quantization:
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=quantization_config,
device_map="auto"
)
# Memory: 14GB → 7GB
Advanced topics
Custom categories: See references/custom-categories.md for fine-tuning LlamaGuard with domain-specific safety categories.
Performance benchmarks: See references/benchmarks.md for accuracy comparison with other moderation APIs and latency optimization.
Deployment guide: See references/deployment.md for Sagemaker, Kubernetes, and scaling strategies.
Hardware requirements
- GPU: NVIDIA T4/A10/A100
- VRAM:
- FP16: 14GB (7B model)
- INT8: 7GB (quantized)
- INT4: 4GB (QLoRA)
- CPU: Possible but slow (10× latency)
- Throughput: 50-100 req/sec (A100)
Latency (single GPU):
- HuggingFace Transformers: 300-500ms
- vLLM: 50-100ms
- Batched (vLLM): 20-50ms per request
Resources
- HuggingFace:
- Paper: https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/
- Integration: vLLM, Sagemaker, NeMo Guardrails
- Accuracy: 94.5% (prompts), 95.3% (responses)
Source
git clone https://github.com/Orchestra-Research/AI-Research-SKILLs/blob/main/07-safety-alignment/llamaguard/SKILL.mdView on GitHub Overview
LlamaGuard is a 7-8B parameter model specialized for content safety classification. It detects six safety categories—violence/hate, sexual content, weapons, substances, self-harm, and criminal planning—with an accuracy around 94-95%. It can be deployed via vLLM, HuggingFace, or SageMaker and integrates with NeMo Guardrails for seamless safety policy enforcement.
How This Skill Works
LlamaGuard acts as a moderation step that returns an 'unsafe' flag and a category code (S1–S6) for prompts or responses. It supports two workflows: input filtering (checking user prompts before they reach the LLM) and output filtering (validating LLM responses before displaying). For production, it can run with vLLM for fast inference or be hosted on HuggingFace or SageMaker, with NeMo Guardrails enabling centralized safety policies.
When to Use It
- Check user prompts before sending to the LLM to block dangerous requests
- Check LLM responses before displaying to users to block unsafe outputs
- Deploy in high-throughput chat services using vLLM for fast inference
- Host on SageMaker or HuggingFace for enterprise deployment with guardrails
- Integrate with NeMo Guardrails to centralize safety policy enforcement
Quick Start
- Step 1: Install dependencies and login to HuggingFace (pip install transformers torch; huggingface-cli login)
- Step 2: Load the model and tokenizer with model_id 'meta-llama/LlamaGuard-7b' and create a moderate() function to classify input/output
- Step 3: Choose a moderation workflow (input or output) and, if needed, deploy via vLLM for fast inference or on SageMaker/HuggingFace with NeMo Guardrails
Best Practices
- Define and regularly review all six safety categories (S1–S6) in policy docs
- Run both input and output moderation to achieve layered safety
- Start with a small, edge-case dataset to calibrate moderation thresholds
- Monitor false positives/negatives and adjust category mappings accordingly
- Validate end-to-end with realistic conversations and maintain audit trails
Example Use Cases
- Blocking prompts like 'How do I hack a website?' or 'How to make explosives?' as unsafe with an appropriate category
- Blocking unsafe outputs such as 'Tell me about harmful substances' before showing to the user
- Batch moderation of multiple chats in a vLLM deployment for faster throughput
- Enterprise deployment using SageMaker or HuggingFace for scalable safety moderation
- Using with NeMo Guardrails to enforce safety policies across applications
Frequently Asked Questions
Related Skills
clip
Orchestra-Research/AI-Research-SKILLs
OpenAI's model connecting vision and language. Enables zero-shot image classification, image-text matching, and cross-modal retrieval. Trained on 400M image-text pairs. Use for image search, content moderation, or vision-language tasks without fine-tuning. Best for general-purpose image understanding.
constitutional-ai
Orchestra-Research/AI-Research-SKILLs
Anthropic's method for training harmless AI through self-improvement. Two-phase approach - supervised learning with self-critique/revision, then RLAIF (RL from AI Feedback). Use for safety alignment, reducing harmful outputs without human labels. Powers Claude's safety system.
nemo-guardrails
Orchestra-Research/AI-Research-SKILLs
NVIDIA's runtime safety framework for LLM applications. Features jailbreak detection, input/output validation, fact-checking, hallucination detection, PII filtering, toxicity detection. Uses Colang 2.0 DSL for programmable rails. Production-ready, runs on T4 GPU.
torchforge-rl-training
Orchestra-Research/AI-Research-SKILLs
Provides guidance for PyTorch-native agentic RL using torchforge, Meta's library separating infra from algorithms. Use when you want clean RL abstractions, easy algorithm experimentation, or scalable training with Monarch and TorchTitan.
prompt-guard
Orchestra-Research/AI-Research-SKILLs
Meta's 86M prompt injection and jailbreak detector. Filters malicious prompts and third-party data for LLM apps. 99%+ TPR, <1% FPR. Fast (<2ms GPU). Multilingual (8 languages). Deploy with HuggingFace or batch processing for RAG security.