Get the FREE Ultimate OpenClaw Setup Guide →

prompt-injection-detector

npx machina-cli add skill a5c-ai/babysitter/prompt-injection-detector --openclaw
Files (1)
SKILL.md
1.2 KB

Prompt Injection Detector Skill

Capabilities

  • Detect prompt injection attempts
  • Implement input sanitization
  • Configure detection classifiers
  • Design defense layers
  • Implement canary token detection
  • Create injection logging and alerting

Target Processes

  • prompt-injection-defense
  • tool-safety-validation

Implementation Details

Detection Methods

  1. Pattern Matching: Known injection patterns
  2. ML Classifiers: Trained injection detectors
  3. Canary Tokens: Detect instruction override
  4. LLM-Based: Use LLM to detect manipulation
  5. Perplexity Analysis: Unusual input patterns

Defense Strategies

  • Input preprocessing
  • Prompt structure design
  • Output validation
  • Sandboxed execution
  • Multi-layer defense

Configuration Options

  • Detection threshold
  • Pattern rules
  • Classifier model
  • Action policies
  • Alerting settings

Best Practices

  • Defense in depth
  • Regular pattern updates
  • Monitor false positives
  • Test with red-team inputs

Dependencies

  • rebuff (optional)
  • transformers
  • Custom classifiers

Source

git clone https://github.com/a5c-ai/babysitter/blob/main/plugins/babysitter/skills/babysit/process/specializations/ai-agents-conversational/skills/prompt-injection-detector/SKILL.mdView on GitHub

Overview

Prompt injection detector secures LLM apps by identifying and blocking injection attempts. It combines pattern matching, ML classifiers, canary tokens, and perplexity analysis to catch adversarial prompts. With defense layers and logging, it helps maintain safe tool usage and prompt integrity.

How This Skill Works

It monitors prompts through multiple detection methods: pattern matching for known attacks, ML-based classifiers, canary tokens, LLM-based assessments, and perplexity checks. When a risk is detected, it triggers input preprocessing, prompt restructuring, output validation, and sandboxed execution as part of a multi-layer defense.

When to Use It

  • When accepting user prompts in chatbots and assistants that access sensitive tools
  • When integrating LLMs with external tools or actions (tool-safety validation)
  • When you need to prevent instruction overrides and manipulation via prompts
  • When auditing for compliance and maintaining an auditable injection log
  • When conducting security testing or red-team exercises to validate defenses

Quick Start

  1. Step 1: Enable prompt-injection-detector in the LLM pipeline and set initial detection threshold
  2. Step 2: Configure pattern rules, classifier model, and action policies
  3. Step 3: Run red-team prompts and review canary token alerts and logs

Best Practices

  • Defense in depth across input, prompt, and output stages
  • Regularly update pattern rules and retrain classifiers
  • Monitor and tune false positives and alert fatigue
  • Test with red-team prompts and adversarial inputs
  • Configure clear action policies and alerting settings

Example Use Cases

  • A customer support bot detects and blocks injection attempts that try to override its instructions
  • A data toolchain sanitizes user prompts before passing to a search or data extraction tool
  • Canary tokens alert when a prompt tries to override system messages
  • LLM-based detector flags manipulated prompts and logs incidents for audit
  • Prompt injection defenses are integrated with sandboxed execution to prevent leakage

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers