Incident Response Helper
Scannednpx machina-cli add skill dcs-soni/awesome-claude-skills/incident-response-helper --openclawIncident Response Helper
Accelerate incident response with structured workflows, log analysis scripts, and automated postmortem generation.
Quick Start
When responding to an incident, copy this checklist:
Incident Response Progress:
- [ ] Step 1: Initial Assessment
- [ ] Step 2: Collect & Analyze Logs
- [ ] Step 3: Build Timeline
- [ ] Step 4: Assess Impact
- [ ] Step 5: Root Cause Analysis
- [ ] Step 6: Resolve & Verify
- [ ] Step 7: Generate Postmortem
Workflow
Step 1: Initial Assessment
Gather information quickly:
- What's broken? — Identify affected services/endpoints
- Severity? — P1 (total outage), P2 (major degradation), P3 (partial impact)
- When did it start? — Approximate start time
- Who's affected? — Users, regions, features
Run quick health check if URL known:
python scripts/check_health.py <url> --timeout 10
Step 2: Collect & Analyze Logs
Gather logs from affected services and analyze:
python scripts/analyze_logs.py <log_file> --format json
Output includes:
- Error patterns and frequency
- Exception stack traces
- Latency anomalies
- HTTP status code distribution
For multiple log files, run against each and compare patterns.
Step 3: Build Timeline
Generate chronological incident timeline:
python scripts/generate_timeline.py <log_dir> --start "YYYY-MM-DDTHH:MM:SS" --end "YYYY-MM-DDTHH:MM:SS"
Key events to identify:
- First error occurrence
- Deployment or config changes
- Traffic patterns
- External dependencies failures
Step 4: Assess Impact
Quantify the damage:
| Metric | How to measure |
|---|---|
| Duration | End time - Start time |
| Users affected | Error logs, support tickets |
| Revenue impact | Failed transactions |
| Data loss | Check persistence layer |
Step 5: Root Cause Analysis
Apply systematic analysis:
- What changed? — Deployments, configs, dependencies
- 5 Whys — Keep asking "why" until root cause found
- Contributing factors — List all factors, not just primary cause
For common issues, see RUNBOOKS.md.
Step 6: Resolve & Verify
- Implement fix — Code change, rollback, or config update
- Verify resolution — Run health checks, monitor metrics
- Communicate — Update stakeholders
Verification:
python scripts/check_health.py <url> --timeout 10
# Verify logs show no new errors
python scripts/analyze_logs.py <new_logs> --format text
Step 7: Generate Postmortem
Create blameless postmortem document:
python scripts/create_postmortem.py --title "Incident Title" --severity P1 --output postmortem.md
See POSTMORTEM.md for template and guidelines.
Utility Scripts
| Script | Purpose |
|---|---|
analyze_logs.py | Parse logs, find error patterns |
generate_timeline.py | Create timeline from logs |
create_postmortem.py | Generate postmortem template |
check_health.py | Quick endpoint health check |
Examples
Example 1: Database Connection Exhaustion
User: "Our API is returning 500s, help me investigate"
- Run
check_health.py→ Confirms 500 errors - Run
analyze_logs.pyon API logs → Finds "connection pool exhausted" - Run
generate_timeline.py→ Shows spike after traffic increase - Check RUNBOOKS.md → Database section has resolution
- Fix: Increase pool size, verify with health check
- Generate postmortem
Example 2: Deployment Caused Regression
User: "Users reporting slow responses after deploy"
- Initial assessment: P2, latency issue
- Analyze logs → High latency in specific endpoint
- Timeline → Correlates with deployment time
- Root cause → N+1 query introduced in new code
- Resolution → Rollback deployment
- Create postmortem with action items
Related Skills
- codebase-onboarding — Understand service architecture first
- api-docs-generator — Document API for better debugging
Source
git clone https://github.com/dcs-soni/awesome-claude-skills/blob/main/incident-response-helper/SKILL.mdView on GitHub Overview
Incident Response Helper accelerates incident response with structured workflows, log analysis, and automated postmortem generation. It guides teams through seven steps—from initial assessment to postmortem—while leveraging scripts like analyze_logs.py, generate_timeline.py, and create_postmortem.py to speed resolution and learning.
How This Skill Works
The skill walks you through a seven-step workflow: initial assessment, collect & analyze logs, build a timeline, assess impact, root cause analysis, resolve & verify, and generate a postmortem. It relies on Python scripts (check_health.py, analyze_logs.py, generate_timeline.py, create_postmortem.py) to perform quick health checks, parse logs, assemble timelines, and create blameless postmortems for shareable learnings.
When to Use It
- When a service is down or experiencing an outage (P1) and you need rapid triage.
- During debugging production with error spikes or degraded performance.
- When investigating API errors (e.g., 500s) and identifying failure points from logs.
- When you need to build a precise, chronological incident timeline for stakeholders.
- When preparing a blameless postmortem to document learning and prevent recurrence.
Quick Start
- Step 1: Initial Assessment
- Step 2: Collect & Analyze Logs
- Step 3: Build Timeline
Best Practices
- Start with Initial Assessment by identifying what's broken, severity, start time, and affected users/regions.
- Run a quick health check with check_health.py to validate endpoint status early.
- Collect and analyze logs with analyze_logs.py, comparing patterns across multiple log files.
- Build a timeline focusing on first errors, deployments/config changes, traffic changes, and external dependency failures.
- Generate and share a blameless postmortem with create_postmortem.py, incorporating learnings and actions.
Example Use Cases
- Example 1: Database connection exhaustion causes API 500s; analyze_logs.py surfaces pool exhaustion and generate_timeline.py shows a spike after a deployment.
- Example 2: External dependency outage leads to cascading latency; initial assessment flags dependency as root cause after log correlation.
- Example 3: Cache eviction triggers higher latency and error rate; timeline highlights cache misses post-deploy.
- Example 4: Deployment drift or misconfigured feature flag causes degraded functionality; RCA traces to recent config changes.
- Example 5: Network partition results in sporadic service degradation; health checks and log patterns identify partition window.