What is incident severity and how is it defined?

Severity levels SEV1–SEV4 map impact to response. Use the provided table to guide prioritization and escalation.

Who leads the incident response?

An Incident Commander owns incident resolution, coordinates response teams, manages communication, makes escalation decisions, and schedules the post-mortem.

What should be included in a Post-Incident Review?

An Incident Summary, Timeline, Root Cause, What Went Well, What Could Be Improved, and Action Items with owners and due dates.

incident-management

Scanned

npx machina-cli add skill BagelHole/DevOps-Security-Agent-Skills/incident-management --openclaw

Files (1)

SKILL.md

1.7 KB

Incident Management

Implement effective incident management processes.

Incident Severity

Severity	Impact	Response	Example
SEV1	Total outage	Immediate, all-hands	Site down
SEV2	Major degradation	Urgent, on-call	Feature broken
SEV3	Minor impact	Standard	Slow performance
SEV4	Minimal	Next business day	Cosmetic issue

Incident Process

incident_workflow:
  1_detect:
    - Alerting triggers
    - Customer reports
    - Monitoring anomalies
    
  2_triage:
    - Severity assessment
    - Impact determination
    - Team notification
    
  3_respond:
    - Incident commander assigned
    - Communication established
    - Mitigation started
    
  4_resolve:
    - Root cause addressed
    - Service restored
    - Customer notified
    
  5_review:
    - Timeline documented
    - Root cause analysis
    - Action items created

Incident Commander

ic_responsibilities:
  - Own incident resolution
  - Coordinate response teams
  - Manage communication
  - Make escalation decisions
  - Schedule post-mortem

Post-Incident Review

## Incident Summary
- Duration:
- Impact:
- Severity:

## Timeline

## Root Cause

## What Went Well

## What Could Be Improved

## Action Items
| Item | Owner | Due Date |

Best Practices

Clear severity definitions
Defined escalation paths
Blameless post-mortems
Action item tracking
Regular training

Source

git clone https://github.com/BagelHole/DevOps-Security-Agent-Skills/blob/main/compliance/continuity/incident-management/SKILL.md

View on GitHub

Overview

Incident management defines how to detect, triage, respond, resolve, and review production incidents. It aligns on-call scheduling, escalation, and blameless post-mortems to shorten downtime and prevent recurrences.

How This Skill Works

You detect incidents via alerts, customer reports, or monitoring anomalies. Triage assigns severity and impact, then the incident commander coordinates the response, communication, and mitigation; after resolution, a post-incident review captures timeline, root cause, and action items.

When to Use It

Managing a production outage (SEV1) requiring immediate escalation
Major degradation affecting user experience (SEV2)
On-call shift coverage and escalation path setup
Post-incident review and RCA documentation
Blameless lessons-learned and action-item tracking

Quick Start

Step 1: Define incident_workflow with 5 stages (detect, triage, respond, resolve, review) and trigger criteria
Step 2: Assign an Incident Commander, establish escalation paths, and set on-call schedules
Step 3: After resolution, run a Post-Incident Review (RCA, timeline, action items) and update runbooks

Best Practices

Clear severity definitions
Defined escalation paths
Blameless post-mortems
Action item tracking
Regular training

Example Use Cases

Site outage caused by database failure requiring an all-hands on-call response
API latency spike impacting multiple features and requiring rapid triage
Payment gateway outage affecting checkout flow
Deployment introduced a service outage requiring rollback and RCA
On-call rotation misconfiguration delaying incident response

Frequently Asked Questions

Add this skill to your agents