incident-response
npx machina-cli add skill anthropics/knowledge-work-plugins/incident-response --openclawIncident Response
Guide incident response from detection through resolution and postmortem.
Severity Classification
| Level | Criteria | Response Time |
|---|---|---|
| SEV1 | Service down, all users affected | Immediate, all-hands |
| SEV2 | Major feature degraded, many users affected | Within 15 min |
| SEV3 | Minor feature issue, some users affected | Within 1 hour |
| SEV4 | Cosmetic or low-impact issue | Next business day |
Response Framework
- Triage: Classify severity, identify scope, assign incident commander
- Communicate: Status page, internal updates, customer comms if needed
- Mitigate: Stop the bleeding first, root cause later
- Resolve: Implement fix, verify, confirm resolution
- Postmortem: Blameless review, 5 whys, action items
Communication Templates
Provide clear, factual updates at regular cadence. Include: what's happening, who's affected, what we're doing, when the next update is.
Postmortem Format
Blameless. Focus on systems and processes. Include timeline, root cause analysis (5 whys), what went well, what went poorly, and action items with owners and due dates.
Source
git clone https://github.com/anthropics/knowledge-work-plugins/blob/main/engineering/skills/incident-response/SKILL.mdView on GitHub Overview
Incident Response guides you from detection to resolution and postmortem. It defines severity levels (SEV1-SEV4), outlines a 5-step response framework, and provides communication templates to keep stakeholders informed.
How This Skill Works
Start with triage to classify severity, scope, and assign an incident commander. Then follow the framework: Triage, Communicate, Mitigate, Resolve, and Postmortem with a blameless 5 whys analysis to drive improvements.
When to Use It
- You detect production downtime affecting all users (SEV1).
- A major feature is degraded and impacts many users (SEV2).
- A minor feature issue affects some users (SEV3).
- Cosmetic or low-impact issues require monitoring and a scheduled fix (SEV4).
- Any reported incident requires triage and an incident commander to coordinate response.
Quick Start
- Step 1: Detect and triage severity, classify scope, and appoint the incident commander.
- Step 2: Communicate clearly (status page, internal updates, customer updates if needed) and implement immediate mitigations.
- Step 3: Resolve the issue, verify the fix, confirm resolution, and conduct a blameless postmortem with action items.
Best Practices
- Classify severity and scope early and assign an incident commander to own the response.
- Provide clear, regular communications (status page, internal updates, customer updates if needed).
- Mitigate first to stop the bleeding, then investigate root cause later.
- Follow the 5-step response framework: Triage, Communicate, Mitigate, Resolve, Postmortem.
- Conduct a blameless postmortem with a 5 whys analysis, timeline, what went well, what went poorly, and actionable owners with due dates.
Example Use Cases
- E-commerce checkout outage (SEV1): all users affected; incident commander appointed; rapid mitigation followed by root-cause analysis and postmortem.
- Search indexing slowdown (SEV2): degraded feature quality for many users; internal updates and customer communications maintained throughout.
- Payment gateway downstream failure (SEV2/SEV1): failed transactions across regions; immediate mitigation and a postmortem with action items.
- Daily batch job delay due to database lock (SEV3): partial user impact; scheduled fix with blameless review afterward.
- Frontend feature toggle causes UI jitter (SEV4): cosmetic impact; monitor and plan a targeted fix with minimal disruption.