Get the FREE Ultimate OpenClaw Setup Guide →

incident-response

npx machina-cli add skill anthropics/knowledge-work-plugins/incident-response --openclaw
Files (1)
SKILL.md
1.4 KB

Incident Response

Guide incident response from detection through resolution and postmortem.

Severity Classification

LevelCriteriaResponse Time
SEV1Service down, all users affectedImmediate, all-hands
SEV2Major feature degraded, many users affectedWithin 15 min
SEV3Minor feature issue, some users affectedWithin 1 hour
SEV4Cosmetic or low-impact issueNext business day

Response Framework

  1. Triage: Classify severity, identify scope, assign incident commander
  2. Communicate: Status page, internal updates, customer comms if needed
  3. Mitigate: Stop the bleeding first, root cause later
  4. Resolve: Implement fix, verify, confirm resolution
  5. Postmortem: Blameless review, 5 whys, action items

Communication Templates

Provide clear, factual updates at regular cadence. Include: what's happening, who's affected, what we're doing, when the next update is.

Postmortem Format

Blameless. Focus on systems and processes. Include timeline, root cause analysis (5 whys), what went well, what went poorly, and action items with owners and due dates.

Source

git clone https://github.com/anthropics/knowledge-work-plugins/blob/main/engineering/skills/incident-response/SKILL.mdView on GitHub

Overview

Incident Response guides you from detection to resolution and postmortem. It defines severity levels (SEV1-SEV4), outlines a 5-step response framework, and provides communication templates to keep stakeholders informed.

How This Skill Works

Start with triage to classify severity, scope, and assign an incident commander. Then follow the framework: Triage, Communicate, Mitigate, Resolve, and Postmortem with a blameless 5 whys analysis to drive improvements.

When to Use It

  • You detect production downtime affecting all users (SEV1).
  • A major feature is degraded and impacts many users (SEV2).
  • A minor feature issue affects some users (SEV3).
  • Cosmetic or low-impact issues require monitoring and a scheduled fix (SEV4).
  • Any reported incident requires triage and an incident commander to coordinate response.

Quick Start

  1. Step 1: Detect and triage severity, classify scope, and appoint the incident commander.
  2. Step 2: Communicate clearly (status page, internal updates, customer updates if needed) and implement immediate mitigations.
  3. Step 3: Resolve the issue, verify the fix, confirm resolution, and conduct a blameless postmortem with action items.

Best Practices

  • Classify severity and scope early and assign an incident commander to own the response.
  • Provide clear, regular communications (status page, internal updates, customer updates if needed).
  • Mitigate first to stop the bleeding, then investigate root cause later.
  • Follow the 5-step response framework: Triage, Communicate, Mitigate, Resolve, Postmortem.
  • Conduct a blameless postmortem with a 5 whys analysis, timeline, what went well, what went poorly, and actionable owners with due dates.

Example Use Cases

  • E-commerce checkout outage (SEV1): all users affected; incident commander appointed; rapid mitigation followed by root-cause analysis and postmortem.
  • Search indexing slowdown (SEV2): degraded feature quality for many users; internal updates and customer communications maintained throughout.
  • Payment gateway downstream failure (SEV2/SEV1): failed transactions across regions; immediate mitigation and a postmortem with action items.
  • Daily batch job delay due to database lock (SEV3): partial user impact; scheduled fix with blameless review afterward.
  • Frontend feature toggle causes UI jitter (SEV4): cosmetic impact; monitor and plan a targeted fix with minimal disruption.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers