What is prompt injection and why is it dangerous?

Prompt injection is when an attacker crafts input to manipulate an LLM's behavior, potentially overriding system prompts or triggering unsafe actions. It can lead to data leakage, unsafe tool access, or policy violations.

What dependencies does this skill require?

Dependencies include optional rebuff, transformers, and custom classifiers. The setup can also leverage canary tokens and multiple detector methods for robust coverage.

How do I test and validate the detector in production?

Integrate the detector into the pipeline, configure reasonable detection thresholds, run red-team prompts, monitor alerts and logs, and refine pattern rules and classifiers based on false positives and new attack patterns.

prompt-injection-detector

npx machina-cli add skill a5c-ai/babysitter/prompt-injection-detector --openclaw

Files (1)

SKILL.md

1.2 KB

Prompt Injection Detector Skill

Capabilities

Detect prompt injection attempts
Implement input sanitization
Configure detection classifiers
Design defense layers
Implement canary token detection
Create injection logging and alerting

Target Processes

prompt-injection-defense
tool-safety-validation

Implementation Details

Detection Methods

Pattern Matching: Known injection patterns
ML Classifiers: Trained injection detectors
Canary Tokens: Detect instruction override
LLM-Based: Use LLM to detect manipulation
Perplexity Analysis: Unusual input patterns

Defense Strategies

Input preprocessing
Prompt structure design
Output validation
Sandboxed execution
Multi-layer defense

Configuration Options

Detection threshold
Pattern rules
Classifier model
Action policies
Alerting settings

Best Practices

Defense in depth
Regular pattern updates
Monitor false positives
Test with red-team inputs

Dependencies

rebuff (optional)
transformers
Custom classifiers

Source

git clone https://github.com/a5c-ai/babysitter/blob/main/plugins/babysitter/skills/babysit/process/specializations/ai-agents-conversational/skills/prompt-injection-detector/SKILL.md

View on GitHub

Overview

Prompt injection detector secures LLM apps by identifying and blocking injection attempts. It combines pattern matching, ML classifiers, canary tokens, and perplexity analysis to catch adversarial prompts. With defense layers and logging, it helps maintain safe tool usage and prompt integrity.

How This Skill Works

It monitors prompts through multiple detection methods: pattern matching for known attacks, ML-based classifiers, canary tokens, LLM-based assessments, and perplexity checks. When a risk is detected, it triggers input preprocessing, prompt restructuring, output validation, and sandboxed execution as part of a multi-layer defense.

When to Use It

When accepting user prompts in chatbots and assistants that access sensitive tools
When integrating LLMs with external tools or actions (tool-safety validation)
When you need to prevent instruction overrides and manipulation via prompts
When auditing for compliance and maintaining an auditable injection log
When conducting security testing or red-team exercises to validate defenses

Quick Start

Step 1: Enable prompt-injection-detector in the LLM pipeline and set initial detection threshold
Step 2: Configure pattern rules, classifier model, and action policies
Step 3: Run red-team prompts and review canary token alerts and logs

Best Practices

Defense in depth across input, prompt, and output stages
Regularly update pattern rules and retrain classifiers
Monitor and tune false positives and alert fatigue
Test with red-team prompts and adversarial inputs
Configure clear action policies and alerting settings

Example Use Cases

A customer support bot detects and blocks injection attempts that try to override its instructions
A data toolchain sanitizes user prompts before passing to a search or data extraction tool
Canary tokens alert when a prompt tries to override system messages
LLM-based detector flags manipulated prompts and logs incidents for audit
Prompt injection defenses are integrated with sandboxed execution to prevent leakage

Frequently Asked Questions

Add this skill to your agents