What does Incident Response Helper do?

It guides structured incident response, performs log analysis, builds a timeline, and generates a blameless postmortem to accelerate resolution and learning.

Which scripts are used by this skill?

It leverages analyze_logs.py for log parsing, generate_timeline.py for timelines, create_postmortem.py for postmortems, and check_health.py for quick endpoint health checks.

When should I use this skill?

Use it when you mention an incident, outage, production issue, debugging production, service down, error spikes, or you need a postmortem.

Incident Response Helper

Scanned

npx machina-cli add skill dcs-soni/awesome-claude-skills/incident-response-helper --openclaw

Files (1)

SKILL.md

4.6 KB

Incident Response Helper

Accelerate incident response with structured workflows, log analysis scripts, and automated postmortem generation.

Quick Start

When responding to an incident, copy this checklist:

Incident Response Progress:
- [ ] Step 1: Initial Assessment
- [ ] Step 2: Collect & Analyze Logs
- [ ] Step 3: Build Timeline
- [ ] Step 4: Assess Impact
- [ ] Step 5: Root Cause Analysis
- [ ] Step 6: Resolve & Verify
- [ ] Step 7: Generate Postmortem

Workflow

Step 1: Initial Assessment

Gather information quickly:

What's broken? — Identify affected services/endpoints
Severity? — P1 (total outage), P2 (major degradation), P3 (partial impact)
When did it start? — Approximate start time
Who's affected? — Users, regions, features

Run quick health check if URL known:

python scripts/check_health.py <url> --timeout 10

Step 2: Collect & Analyze Logs

Gather logs from affected services and analyze:

python scripts/analyze_logs.py <log_file> --format json

Output includes:

Error patterns and frequency
Exception stack traces
Latency anomalies
HTTP status code distribution

For multiple log files, run against each and compare patterns.

Step 3: Build Timeline

Generate chronological incident timeline:

python scripts/generate_timeline.py <log_dir> --start "YYYY-MM-DDTHH:MM:SS" --end "YYYY-MM-DDTHH:MM:SS"

Key events to identify:

First error occurrence
Deployment or config changes
Traffic patterns
External dependencies failures

Step 4: Assess Impact

Quantify the damage:

Metric	How to measure
Duration	End time - Start time
Users affected	Error logs, support tickets
Revenue impact	Failed transactions
Data loss	Check persistence layer

Step 5: Root Cause Analysis

Apply systematic analysis:

What changed? — Deployments, configs, dependencies
5 Whys — Keep asking "why" until root cause found
Contributing factors — List all factors, not just primary cause

For common issues, see RUNBOOKS.md.

Step 6: Resolve & Verify

Implement fix — Code change, rollback, or config update
Verify resolution — Run health checks, monitor metrics
Communicate — Update stakeholders

Verification:

python scripts/check_health.py <url> --timeout 10
# Verify logs show no new errors
python scripts/analyze_logs.py <new_logs> --format text

Step 7: Generate Postmortem

Create blameless postmortem document:

python scripts/create_postmortem.py --title "Incident Title" --severity P1 --output postmortem.md

See POSTMORTEM.md for template and guidelines.

Utility Scripts

Script	Purpose
`analyze_logs.py`	Parse logs, find error patterns
`generate_timeline.py`	Create timeline from logs
`create_postmortem.py`	Generate postmortem template
`check_health.py`	Quick endpoint health check

Examples

Example 1: Database Connection Exhaustion

User: "Our API is returning 500s, help me investigate"

Run check_health.py → Confirms 500 errors
Run analyze_logs.py on API logs → Finds "connection pool exhausted"
Run generate_timeline.py → Shows spike after traffic increase
Check RUNBOOKS.md → Database section has resolution
Fix: Increase pool size, verify with health check
Generate postmortem

Example 2: Deployment Caused Regression

User: "Users reporting slow responses after deploy"

Initial assessment: P2, latency issue
Analyze logs → High latency in specific endpoint
Timeline → Correlates with deployment time
Root cause → N+1 query introduced in new code
Resolution → Rollback deployment
Create postmortem with action items

Related Skills

codebase-onboarding — Understand service architecture first
api-docs-generator — Document API for better debugging

Source

git clone https://github.com/dcs-soni/awesome-claude-skills/blob/main/incident-response-helper/SKILL.mdView on GitHub

Overview

Incident Response Helper accelerates incident response with structured workflows, log analysis, and automated postmortem generation. It guides teams through seven steps—from initial assessment to postmortem—while leveraging scripts like analyze_logs.py, generate_timeline.py, and create_postmortem.py to speed resolution and learning.

How This Skill Works

The skill walks you through a seven-step workflow: initial assessment, collect & analyze logs, build a timeline, assess impact, root cause analysis, resolve & verify, and generate a postmortem. It relies on Python scripts (check_health.py, analyze_logs.py, generate_timeline.py, create_postmortem.py) to perform quick health checks, parse logs, assemble timelines, and create blameless postmortems for shareable learnings.

When to Use It

When a service is down or experiencing an outage (P1) and you need rapid triage.
During debugging production with error spikes or degraded performance.
When investigating API errors (e.g., 500s) and identifying failure points from logs.
When you need to build a precise, chronological incident timeline for stakeholders.
When preparing a blameless postmortem to document learning and prevent recurrence.

Quick Start

Step 1: Initial Assessment
Step 2: Collect & Analyze Logs
Step 3: Build Timeline

Best Practices

Start with Initial Assessment by identifying what's broken, severity, start time, and affected users/regions.
Run a quick health check with check_health.py to validate endpoint status early.
Collect and analyze logs with analyze_logs.py, comparing patterns across multiple log files.
Build a timeline focusing on first errors, deployments/config changes, traffic changes, and external dependency failures.
Generate and share a blameless postmortem with create_postmortem.py, incorporating learnings and actions.

Example Use Cases

Example 1: Database connection exhaustion causes API 500s; analyze_logs.py surfaces pool exhaustion and generate_timeline.py shows a spike after a deployment.
Example 2: External dependency outage leads to cascading latency; initial assessment flags dependency as root cause after log correlation.
Example 3: Cache eviction triggers higher latency and error rate; timeline highlights cache misses post-deploy.
Example 4: Deployment drift or misconfigured feature flag causes degraded functionality; RCA traces to recent config changes.
Example 5: Network partition results in sporadic service degradation; health checks and log patterns identify partition window.

Frequently Asked Questions

Add this skill to your agents