incident-management
Scannednpx machina-cli add skill BagelHole/DevOps-Security-Agent-Skills/incident-management --openclawFiles (1)
SKILL.md
1.7 KB
Incident Management
Implement effective incident management processes.
Incident Severity
| Severity | Impact | Response | Example |
|---|---|---|---|
| SEV1 | Total outage | Immediate, all-hands | Site down |
| SEV2 | Major degradation | Urgent, on-call | Feature broken |
| SEV3 | Minor impact | Standard | Slow performance |
| SEV4 | Minimal | Next business day | Cosmetic issue |
Incident Process
incident_workflow:
1_detect:
- Alerting triggers
- Customer reports
- Monitoring anomalies
2_triage:
- Severity assessment
- Impact determination
- Team notification
3_respond:
- Incident commander assigned
- Communication established
- Mitigation started
4_resolve:
- Root cause addressed
- Service restored
- Customer notified
5_review:
- Timeline documented
- Root cause analysis
- Action items created
Incident Commander
ic_responsibilities:
- Own incident resolution
- Coordinate response teams
- Manage communication
- Make escalation decisions
- Schedule post-mortem
Post-Incident Review
## Incident Summary
- Duration:
- Impact:
- Severity:
## Timeline
## Root Cause
## What Went Well
## What Could Be Improved
## Action Items
| Item | Owner | Due Date |
Best Practices
- Clear severity definitions
- Defined escalation paths
- Blameless post-mortems
- Action item tracking
- Regular training
Source
git clone https://github.com/BagelHole/DevOps-Security-Agent-Skills/blob/main/compliance/continuity/incident-management/SKILL.mdView on GitHub Overview
Incident management defines how to detect, triage, respond, resolve, and review production incidents. It aligns on-call scheduling, escalation, and blameless post-mortems to shorten downtime and prevent recurrences.
How This Skill Works
You detect incidents via alerts, customer reports, or monitoring anomalies. Triage assigns severity and impact, then the incident commander coordinates the response, communication, and mitigation; after resolution, a post-incident review captures timeline, root cause, and action items.
When to Use It
- Managing a production outage (SEV1) requiring immediate escalation
- Major degradation affecting user experience (SEV2)
- On-call shift coverage and escalation path setup
- Post-incident review and RCA documentation
- Blameless lessons-learned and action-item tracking
Quick Start
- Step 1: Define incident_workflow with 5 stages (detect, triage, respond, resolve, review) and trigger criteria
- Step 2: Assign an Incident Commander, establish escalation paths, and set on-call schedules
- Step 3: After resolution, run a Post-Incident Review (RCA, timeline, action items) and update runbooks
Best Practices
- Clear severity definitions
- Defined escalation paths
- Blameless post-mortems
- Action item tracking
- Regular training
Example Use Cases
- Site outage caused by database failure requiring an all-hands on-call response
- API latency spike impacting multiple features and requiring rapid triage
- Payment gateway outage affecting checkout flow
- Deployment introduced a service outage requiring rollback and RCA
- On-call rotation misconfiguration delaying incident response
Frequently Asked Questions
Add this skill to your agents