Get the FREE Ultimate OpenClaw Setup Guide →

incident-management

Scanned
npx machina-cli add skill BagelHole/DevOps-Security-Agent-Skills/incident-management --openclaw
Files (1)
SKILL.md
1.7 KB

Incident Management

Implement effective incident management processes.

Incident Severity

SeverityImpactResponseExample
SEV1Total outageImmediate, all-handsSite down
SEV2Major degradationUrgent, on-callFeature broken
SEV3Minor impactStandardSlow performance
SEV4MinimalNext business dayCosmetic issue

Incident Process

incident_workflow:
  1_detect:
    - Alerting triggers
    - Customer reports
    - Monitoring anomalies
    
  2_triage:
    - Severity assessment
    - Impact determination
    - Team notification
    
  3_respond:
    - Incident commander assigned
    - Communication established
    - Mitigation started
    
  4_resolve:
    - Root cause addressed
    - Service restored
    - Customer notified
    
  5_review:
    - Timeline documented
    - Root cause analysis
    - Action items created

Incident Commander

ic_responsibilities:
  - Own incident resolution
  - Coordinate response teams
  - Manage communication
  - Make escalation decisions
  - Schedule post-mortem

Post-Incident Review

## Incident Summary
- Duration:
- Impact:
- Severity:

## Timeline

## Root Cause

## What Went Well

## What Could Be Improved

## Action Items
| Item | Owner | Due Date |

Best Practices

  • Clear severity definitions
  • Defined escalation paths
  • Blameless post-mortems
  • Action item tracking
  • Regular training

Source

git clone https://github.com/BagelHole/DevOps-Security-Agent-Skills/blob/main/compliance/continuity/incident-management/SKILL.mdView on GitHub

Overview

Incident management defines how to detect, triage, respond, resolve, and review production incidents. It aligns on-call scheduling, escalation, and blameless post-mortems to shorten downtime and prevent recurrences.

How This Skill Works

You detect incidents via alerts, customer reports, or monitoring anomalies. Triage assigns severity and impact, then the incident commander coordinates the response, communication, and mitigation; after resolution, a post-incident review captures timeline, root cause, and action items.

When to Use It

  • Managing a production outage (SEV1) requiring immediate escalation
  • Major degradation affecting user experience (SEV2)
  • On-call shift coverage and escalation path setup
  • Post-incident review and RCA documentation
  • Blameless lessons-learned and action-item tracking

Quick Start

  1. Step 1: Define incident_workflow with 5 stages (detect, triage, respond, resolve, review) and trigger criteria
  2. Step 2: Assign an Incident Commander, establish escalation paths, and set on-call schedules
  3. Step 3: After resolution, run a Post-Incident Review (RCA, timeline, action items) and update runbooks

Best Practices

  • Clear severity definitions
  • Defined escalation paths
  • Blameless post-mortems
  • Action item tracking
  • Regular training

Example Use Cases

  • Site outage caused by database failure requiring an all-hands on-call response
  • API latency spike impacting multiple features and requiring rapid triage
  • Payment gateway outage affecting checkout flow
  • Deployment introduced a service outage requiring rollback and RCA
  • On-call rotation misconfiguration delaying incident response

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers