Get the FREE Ultimate OpenClaw Setup Guide →

Incident Response Helper

Scanned
npx machina-cli add skill dcs-soni/awesome-claude-skills/incident-response-helper --openclaw
Files (1)
SKILL.md
4.6 KB

Incident Response Helper

Accelerate incident response with structured workflows, log analysis scripts, and automated postmortem generation.

Quick Start

When responding to an incident, copy this checklist:

Incident Response Progress:
- [ ] Step 1: Initial Assessment
- [ ] Step 2: Collect & Analyze Logs
- [ ] Step 3: Build Timeline
- [ ] Step 4: Assess Impact
- [ ] Step 5: Root Cause Analysis
- [ ] Step 6: Resolve & Verify
- [ ] Step 7: Generate Postmortem

Workflow

Step 1: Initial Assessment

Gather information quickly:

  1. What's broken? — Identify affected services/endpoints
  2. Severity? — P1 (total outage), P2 (major degradation), P3 (partial impact)
  3. When did it start? — Approximate start time
  4. Who's affected? — Users, regions, features

Run quick health check if URL known:

python scripts/check_health.py <url> --timeout 10

Step 2: Collect & Analyze Logs

Gather logs from affected services and analyze:

python scripts/analyze_logs.py <log_file> --format json

Output includes:

  • Error patterns and frequency
  • Exception stack traces
  • Latency anomalies
  • HTTP status code distribution

For multiple log files, run against each and compare patterns.

Step 3: Build Timeline

Generate chronological incident timeline:

python scripts/generate_timeline.py <log_dir> --start "YYYY-MM-DDTHH:MM:SS" --end "YYYY-MM-DDTHH:MM:SS"

Key events to identify:

  • First error occurrence
  • Deployment or config changes
  • Traffic patterns
  • External dependencies failures

Step 4: Assess Impact

Quantify the damage:

MetricHow to measure
DurationEnd time - Start time
Users affectedError logs, support tickets
Revenue impactFailed transactions
Data lossCheck persistence layer

Step 5: Root Cause Analysis

Apply systematic analysis:

  1. What changed? — Deployments, configs, dependencies
  2. 5 Whys — Keep asking "why" until root cause found
  3. Contributing factors — List all factors, not just primary cause

For common issues, see RUNBOOKS.md.

Step 6: Resolve & Verify

  1. Implement fix — Code change, rollback, or config update
  2. Verify resolution — Run health checks, monitor metrics
  3. Communicate — Update stakeholders

Verification:

python scripts/check_health.py <url> --timeout 10
# Verify logs show no new errors
python scripts/analyze_logs.py <new_logs> --format text

Step 7: Generate Postmortem

Create blameless postmortem document:

python scripts/create_postmortem.py --title "Incident Title" --severity P1 --output postmortem.md

See POSTMORTEM.md for template and guidelines.


Utility Scripts

ScriptPurpose
analyze_logs.pyParse logs, find error patterns
generate_timeline.pyCreate timeline from logs
create_postmortem.pyGenerate postmortem template
check_health.pyQuick endpoint health check

Examples

Example 1: Database Connection Exhaustion

User: "Our API is returning 500s, help me investigate"

  1. Run check_health.py → Confirms 500 errors
  2. Run analyze_logs.py on API logs → Finds "connection pool exhausted"
  3. Run generate_timeline.py → Shows spike after traffic increase
  4. Check RUNBOOKS.md → Database section has resolution
  5. Fix: Increase pool size, verify with health check
  6. Generate postmortem

Example 2: Deployment Caused Regression

User: "Users reporting slow responses after deploy"

  1. Initial assessment: P2, latency issue
  2. Analyze logs → High latency in specific endpoint
  3. Timeline → Correlates with deployment time
  4. Root cause → N+1 query introduced in new code
  5. Resolution → Rollback deployment
  6. Create postmortem with action items

Related Skills

  • codebase-onboarding — Understand service architecture first
  • api-docs-generator — Document API for better debugging

Source

git clone https://github.com/dcs-soni/awesome-claude-skills/blob/main/incident-response-helper/SKILL.mdView on GitHub

Overview

Incident Response Helper accelerates incident response with structured workflows, log analysis, and automated postmortem generation. It guides teams through seven steps—from initial assessment to postmortem—while leveraging scripts like analyze_logs.py, generate_timeline.py, and create_postmortem.py to speed resolution and learning.

How This Skill Works

The skill walks you through a seven-step workflow: initial assessment, collect & analyze logs, build a timeline, assess impact, root cause analysis, resolve & verify, and generate a postmortem. It relies on Python scripts (check_health.py, analyze_logs.py, generate_timeline.py, create_postmortem.py) to perform quick health checks, parse logs, assemble timelines, and create blameless postmortems for shareable learnings.

When to Use It

  • When a service is down or experiencing an outage (P1) and you need rapid triage.
  • During debugging production with error spikes or degraded performance.
  • When investigating API errors (e.g., 500s) and identifying failure points from logs.
  • When you need to build a precise, chronological incident timeline for stakeholders.
  • When preparing a blameless postmortem to document learning and prevent recurrence.

Quick Start

  1. Step 1: Initial Assessment
  2. Step 2: Collect & Analyze Logs
  3. Step 3: Build Timeline

Best Practices

  • Start with Initial Assessment by identifying what's broken, severity, start time, and affected users/regions.
  • Run a quick health check with check_health.py to validate endpoint status early.
  • Collect and analyze logs with analyze_logs.py, comparing patterns across multiple log files.
  • Build a timeline focusing on first errors, deployments/config changes, traffic changes, and external dependency failures.
  • Generate and share a blameless postmortem with create_postmortem.py, incorporating learnings and actions.

Example Use Cases

  • Example 1: Database connection exhaustion causes API 500s; analyze_logs.py surfaces pool exhaustion and generate_timeline.py shows a spike after a deployment.
  • Example 2: External dependency outage leads to cascading latency; initial assessment flags dependency as root cause after log correlation.
  • Example 3: Cache eviction triggers higher latency and error rate; timeline highlights cache misses post-deploy.
  • Example 4: Deployment drift or misconfigured feature flag causes degraded functionality; RCA traces to recent config changes.
  • Example 5: Network partition results in sporadic service degradation; health checks and log patterns identify partition window.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers