incident-response
Scannednpx machina-cli add skill agenticdevops/devops-execution-engine/incident-response --openclawIncident Response
Structured workflow for diagnosing and resolving production incidents.
When to Use This Skill
Use this skill when:
- An alert has fired
- Users report issues
- Monitoring shows anomalies
- System behavior is unexpected
- You're the on-call responder
Incident Response Phases
Phase 1: Triage (First 5 minutes)
Goal: Assess severity and impact.
Quick Assessment Questions
- What is broken? (service, feature, infrastructure)
- Who is affected? (all users, subset, internal only)
- When did it start? (correlate with deployments/changes)
- Is it getting worse? (check error rate trend)
Severity Classification
| Severity | Impact | Response |
|---|---|---|
| SEV1 | Complete outage, all users affected | All hands, exec notification |
| SEV2 | Major feature broken, many users affected | Primary + backup on-call |
| SEV3 | Minor feature broken, some users affected | Primary on-call |
| SEV4 | No user impact, potential issue | Next business day |
Phase 2: Context Gathering (5-15 minutes)
Goal: Collect data to understand the problem.
Kubernetes Context
# Cluster health overview
kubectl get nodes
kubectl get pods -A | grep -v Running | grep -v Completed
# Recent events
kubectl get events -A --sort-by='.lastTimestamp' | tail -30
# Check specific service
kubectl get pods -l app=<service-name> -o wide
kubectl logs -l app=<service-name> --tail=100
Recent Changes
# Recent deployments
kubectl get deployments -A -o custom-columns=\
'NAMESPACE:.metadata.namespace,NAME:.metadata.name,UPDATED:.metadata.creationTimestamp' \
| sort -k3 -r | head -10
# Git history (if in repo)
git log --oneline --since="2 hours ago"
Metrics Check
# If Prometheus available
# Check error rates, latency, traffic
# If CloudWatch
aws cloudwatch get-metric-statistics \
--namespace <namespace> \
--metric-name <metric> \
--start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
--period 60 \
--statistics Average
Phase 3: Diagnosis (15-30 minutes)
Goal: Identify root cause.
Common Root Causes
- Recent deployment - Check if timing correlates
- Resource exhaustion - CPU, memory, disk, connections
- External dependency failure - Database, API, DNS
- Configuration change - ConfigMaps, secrets, feature flags
- Traffic spike - Unexpected load
- Certificate/credential expiry - Auth failures
- Infrastructure issue - Node failure, network partition
Diagnosis Decision Tree
Is there a recent deployment?
├── Yes → Check deployment logs, consider rollback
└── No → Continue
Are pods crashing?
├── Yes → Check logs: kubectl logs <pod> --previous
└── No → Continue
Are resources exhausted?
├── Yes → Scale up or optimize
└── No → Continue
Is an external dependency failing?
├── Yes → Check dependency status, implement fallback
└── No → Continue
Is there a traffic spike?
├── Yes → Scale up, enable rate limiting
└── No → Escalate for deeper investigation
Phase 4: Mitigation (ASAP)
Goal: Restore service, even if root cause unknown.
Quick Mitigations
Rollback Deployment:
# Kubernetes rollback
kubectl rollout undo deployment/<name> -n <namespace>
kubectl rollout status deployment/<name> -n <namespace>
Scale Up:
# Increase replicas
kubectl scale deployment/<name> --replicas=5 -n <namespace>
Restart Pods:
# Rolling restart
kubectl rollout restart deployment/<name> -n <namespace>
Toggle Feature Flag:
# If feature flags available, disable problematic feature
Redirect Traffic:
# If multiple regions, redirect away from affected region
Phase 5: Communication
Goal: Keep stakeholders informed.
Status Update Template
**Incident Update - [SERVICE] - [SEV LEVEL]**
**Status:** Investigating / Identified / Mitigating / Resolved
**Impact:** [Who/what is affected]
**Start Time:** [When it started]
**Current Actions:** [What we're doing]
**Next Update:** [When to expect next update]
---
Incident Commander: [Name]
Communication Cadence
| Severity | Update Frequency |
|---|---|
| SEV1 | Every 15 minutes |
| SEV2 | Every 30 minutes |
| SEV3 | Every hour |
| SEV4 | End of day |
Phase 6: Resolution
Goal: Confirm service is restored.
Verification Checklist
- Error rates returned to baseline
- Latency returned to baseline
- All pods healthy
- Synthetic monitors passing
- User reports have stopped
Close Incident
**Incident Resolved - [SERVICE]**
**Duration:** [Start] to [End] ([X] minutes)
**Root Cause:** [Brief description]
**Resolution:** [What fixed it]
**Follow-ups:** [Links to action items]
Post-mortem scheduled: [Date/Time]
Post-Incident
Post-Mortem Template
# Incident Post-Mortem: [Title]
**Date:** [Date]
**Duration:** [Duration]
**Severity:** [SEV Level]
**Authors:** [Names]
## Summary
[2-3 sentence summary]
## Impact
- Users affected: [Number/percentage]
- Revenue impact: [If applicable]
- SLA impact: [If applicable]
## Timeline
- HH:MM - [Event]
- HH:MM - [Event]
- HH:MM - [Event]
## Root Cause
[Detailed explanation]
## Resolution
[How it was fixed]
## Lessons Learned
### What went well
- [Item]
### What could be improved
- [Item]
## Action Items
- [ ] [Action] - Owner: [Name] - Due: [Date]
- [ ] [Action] - Owner: [Name] - Due: [Date]
Quick Reference
Essential Commands
# Quick cluster health
kubectl get nodes && kubectl get pods -A | grep -v Running
# Service status
kubectl get pods,svc,endpoints -l app=<service>
# Recent events
kubectl get events --sort-by='.lastTimestamp' | tail -20
# Quick logs
kubectl logs -l app=<service> --tail=50 --all-containers
Escalation Contacts
Document your escalation path:
- Primary on-call
- Secondary on-call
- Team lead
- Engineering manager
- VP Engineering (SEV1 only)
Related Skills
- k8s-debug: For Kubernetes-specific debugging
- log-analysis: For log pattern analysis
- argocd-gitops: For GitOps rollbacks
Source
git clone https://github.com/agenticdevops/devops-execution-engine/blob/main/skills/incident-response/SKILL.mdView on GitHub Overview
Incident Response provides a structured workflow for diagnosing and resolving production incidents. It guides triage, context gathering, diagnosis, and rapid mitigation to restore services.
How This Skill Works
The skill divides incidents into four phases: Triage, Context Gathering, Diagnosis, and Mitigation. It leverages real-time data from Kubernetes, logs, and metrics to identify root causes and applies safe mitigations like rollbacks or scaling to restore functionality.
When to Use It
- An alert has fired
- Users report issues
- Monitoring shows anomalies
- System behavior is unexpected
- You're the on-call responder
Quick Start
- Step 1: Identify incident and perform Phase 1 triage to classify severity (SEV1-SEV4).
- Step 2: Gather context with Kubernetes commands, recent changes, and metrics.
- Step 3: Mitigate quickly with rollback/scale/restart, then document and communicate.
Best Practices
- Start with Phase 1 quick assessment questions to gauge severity and impact.
- Gather context efficiently (pods, deployments, events, logs, metrics) using kubectl, logs, and monitoring tools.
- Apply the Severity table (SEV1-SEV4) to classify impact and align the response.
- Follow the Diagnosis decision tree to identify root causes and decide on safe mitigations.
- Communicate status updates and next steps to stakeholders; escalate as needed.
Example Use Cases
- Outage caused by a recent deployment; used Phase 1 triage and rollback to restore service.
- Memory exhaustion led to degraded performance; scaled deployment and restarted affected pods.
- External dependency failure (database/API) detected; implemented fallback and retries.
- Traffic spike with no security issue; enabled rate limiting and autoscaling to handle load.
- Certificate expiry caused authentication failures; rotated credentials and resecured services.