prompt-injection-detector
npx machina-cli add skill a5c-ai/babysitter/prompt-injection-detector --openclawFiles (1)
SKILL.md
1.2 KB
Prompt Injection Detector Skill
Capabilities
- Detect prompt injection attempts
- Implement input sanitization
- Configure detection classifiers
- Design defense layers
- Implement canary token detection
- Create injection logging and alerting
Target Processes
- prompt-injection-defense
- tool-safety-validation
Implementation Details
Detection Methods
- Pattern Matching: Known injection patterns
- ML Classifiers: Trained injection detectors
- Canary Tokens: Detect instruction override
- LLM-Based: Use LLM to detect manipulation
- Perplexity Analysis: Unusual input patterns
Defense Strategies
- Input preprocessing
- Prompt structure design
- Output validation
- Sandboxed execution
- Multi-layer defense
Configuration Options
- Detection threshold
- Pattern rules
- Classifier model
- Action policies
- Alerting settings
Best Practices
- Defense in depth
- Regular pattern updates
- Monitor false positives
- Test with red-team inputs
Dependencies
- rebuff (optional)
- transformers
- Custom classifiers
Source
git clone https://github.com/a5c-ai/babysitter/blob/main/plugins/babysitter/skills/babysit/process/specializations/ai-agents-conversational/skills/prompt-injection-detector/SKILL.mdView on GitHub Overview
Prompt injection detector secures LLM apps by identifying and blocking injection attempts. It combines pattern matching, ML classifiers, canary tokens, and perplexity analysis to catch adversarial prompts. With defense layers and logging, it helps maintain safe tool usage and prompt integrity.
How This Skill Works
It monitors prompts through multiple detection methods: pattern matching for known attacks, ML-based classifiers, canary tokens, LLM-based assessments, and perplexity checks. When a risk is detected, it triggers input preprocessing, prompt restructuring, output validation, and sandboxed execution as part of a multi-layer defense.
When to Use It
- When accepting user prompts in chatbots and assistants that access sensitive tools
- When integrating LLMs with external tools or actions (tool-safety validation)
- When you need to prevent instruction overrides and manipulation via prompts
- When auditing for compliance and maintaining an auditable injection log
- When conducting security testing or red-team exercises to validate defenses
Quick Start
- Step 1: Enable prompt-injection-detector in the LLM pipeline and set initial detection threshold
- Step 2: Configure pattern rules, classifier model, and action policies
- Step 3: Run red-team prompts and review canary token alerts and logs
Best Practices
- Defense in depth across input, prompt, and output stages
- Regularly update pattern rules and retrain classifiers
- Monitor and tune false positives and alert fatigue
- Test with red-team prompts and adversarial inputs
- Configure clear action policies and alerting settings
Example Use Cases
- A customer support bot detects and blocks injection attempts that try to override its instructions
- A data toolchain sanitizes user prompts before passing to a search or data extraction tool
- Canary tokens alert when a prompt tries to override system messages
- LLM-based detector flags manipulated prompts and logs incidents for audit
- Prompt injection defenses are integrated with sandboxed execution to prevent leakage
Frequently Asked Questions
Add this skill to your agents