Get the FREE Ultimate OpenClaw Setup Guide →

incident-response

Scanned
npx machina-cli add skill agenticdevops/devops-execution-engine/incident-response --openclaw
Files (1)
SKILL.md
6.5 KB

Incident Response

Structured workflow for diagnosing and resolving production incidents.

When to Use This Skill

Use this skill when:

  • An alert has fired
  • Users report issues
  • Monitoring shows anomalies
  • System behavior is unexpected
  • You're the on-call responder

Incident Response Phases

Phase 1: Triage (First 5 minutes)

Goal: Assess severity and impact.

Quick Assessment Questions

  1. What is broken? (service, feature, infrastructure)
  2. Who is affected? (all users, subset, internal only)
  3. When did it start? (correlate with deployments/changes)
  4. Is it getting worse? (check error rate trend)

Severity Classification

SeverityImpactResponse
SEV1Complete outage, all users affectedAll hands, exec notification
SEV2Major feature broken, many users affectedPrimary + backup on-call
SEV3Minor feature broken, some users affectedPrimary on-call
SEV4No user impact, potential issueNext business day

Phase 2: Context Gathering (5-15 minutes)

Goal: Collect data to understand the problem.

Kubernetes Context

# Cluster health overview
kubectl get nodes
kubectl get pods -A | grep -v Running | grep -v Completed

# Recent events
kubectl get events -A --sort-by='.lastTimestamp' | tail -30

# Check specific service
kubectl get pods -l app=<service-name> -o wide
kubectl logs -l app=<service-name> --tail=100

Recent Changes

# Recent deployments
kubectl get deployments -A -o custom-columns=\
'NAMESPACE:.metadata.namespace,NAME:.metadata.name,UPDATED:.metadata.creationTimestamp' \
| sort -k3 -r | head -10

# Git history (if in repo)
git log --oneline --since="2 hours ago"

Metrics Check

# If Prometheus available
# Check error rates, latency, traffic

# If CloudWatch
aws cloudwatch get-metric-statistics \
  --namespace <namespace> \
  --metric-name <metric> \
  --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 \
  --statistics Average

Phase 3: Diagnosis (15-30 minutes)

Goal: Identify root cause.

Common Root Causes

  1. Recent deployment - Check if timing correlates
  2. Resource exhaustion - CPU, memory, disk, connections
  3. External dependency failure - Database, API, DNS
  4. Configuration change - ConfigMaps, secrets, feature flags
  5. Traffic spike - Unexpected load
  6. Certificate/credential expiry - Auth failures
  7. Infrastructure issue - Node failure, network partition

Diagnosis Decision Tree

Is there a recent deployment?
├── Yes → Check deployment logs, consider rollback
└── No → Continue

Are pods crashing?
├── Yes → Check logs: kubectl logs <pod> --previous
└── No → Continue

Are resources exhausted?
├── Yes → Scale up or optimize
└── No → Continue

Is an external dependency failing?
├── Yes → Check dependency status, implement fallback
└── No → Continue

Is there a traffic spike?
├── Yes → Scale up, enable rate limiting
└── No → Escalate for deeper investigation

Phase 4: Mitigation (ASAP)

Goal: Restore service, even if root cause unknown.

Quick Mitigations

Rollback Deployment:

# Kubernetes rollback
kubectl rollout undo deployment/<name> -n <namespace>
kubectl rollout status deployment/<name> -n <namespace>

Scale Up:

# Increase replicas
kubectl scale deployment/<name> --replicas=5 -n <namespace>

Restart Pods:

# Rolling restart
kubectl rollout restart deployment/<name> -n <namespace>

Toggle Feature Flag:

# If feature flags available, disable problematic feature

Redirect Traffic:

# If multiple regions, redirect away from affected region

Phase 5: Communication

Goal: Keep stakeholders informed.

Status Update Template

**Incident Update - [SERVICE] - [SEV LEVEL]**

**Status:** Investigating / Identified / Mitigating / Resolved
**Impact:** [Who/what is affected]
**Start Time:** [When it started]
**Current Actions:** [What we're doing]
**Next Update:** [When to expect next update]

---
Incident Commander: [Name]

Communication Cadence

SeverityUpdate Frequency
SEV1Every 15 minutes
SEV2Every 30 minutes
SEV3Every hour
SEV4End of day

Phase 6: Resolution

Goal: Confirm service is restored.

Verification Checklist

  • Error rates returned to baseline
  • Latency returned to baseline
  • All pods healthy
  • Synthetic monitors passing
  • User reports have stopped

Close Incident

**Incident Resolved - [SERVICE]**

**Duration:** [Start] to [End] ([X] minutes)
**Root Cause:** [Brief description]
**Resolution:** [What fixed it]
**Follow-ups:** [Links to action items]

Post-mortem scheduled: [Date/Time]

Post-Incident

Post-Mortem Template

# Incident Post-Mortem: [Title]

**Date:** [Date]
**Duration:** [Duration]
**Severity:** [SEV Level]
**Authors:** [Names]

## Summary
[2-3 sentence summary]

## Impact
- Users affected: [Number/percentage]
- Revenue impact: [If applicable]
- SLA impact: [If applicable]

## Timeline
- HH:MM - [Event]
- HH:MM - [Event]
- HH:MM - [Event]

## Root Cause
[Detailed explanation]

## Resolution
[How it was fixed]

## Lessons Learned
### What went well
- [Item]

### What could be improved
- [Item]

## Action Items
- [ ] [Action] - Owner: [Name] - Due: [Date]
- [ ] [Action] - Owner: [Name] - Due: [Date]

Quick Reference

Essential Commands

# Quick cluster health
kubectl get nodes && kubectl get pods -A | grep -v Running

# Service status
kubectl get pods,svc,endpoints -l app=<service>

# Recent events
kubectl get events --sort-by='.lastTimestamp' | tail -20

# Quick logs
kubectl logs -l app=<service> --tail=50 --all-containers

Escalation Contacts

Document your escalation path:

  1. Primary on-call
  2. Secondary on-call
  3. Team lead
  4. Engineering manager
  5. VP Engineering (SEV1 only)

Related Skills

  • k8s-debug: For Kubernetes-specific debugging
  • log-analysis: For log pattern analysis
  • argocd-gitops: For GitOps rollbacks

Source

git clone https://github.com/agenticdevops/devops-execution-engine/blob/main/skills/incident-response/SKILL.mdView on GitHub

Overview

Incident Response provides a structured workflow for diagnosing and resolving production incidents. It guides triage, context gathering, diagnosis, and rapid mitigation to restore services.

How This Skill Works

The skill divides incidents into four phases: Triage, Context Gathering, Diagnosis, and Mitigation. It leverages real-time data from Kubernetes, logs, and metrics to identify root causes and applies safe mitigations like rollbacks or scaling to restore functionality.

When to Use It

  • An alert has fired
  • Users report issues
  • Monitoring shows anomalies
  • System behavior is unexpected
  • You're the on-call responder

Quick Start

  1. Step 1: Identify incident and perform Phase 1 triage to classify severity (SEV1-SEV4).
  2. Step 2: Gather context with Kubernetes commands, recent changes, and metrics.
  3. Step 3: Mitigate quickly with rollback/scale/restart, then document and communicate.

Best Practices

  • Start with Phase 1 quick assessment questions to gauge severity and impact.
  • Gather context efficiently (pods, deployments, events, logs, metrics) using kubectl, logs, and monitoring tools.
  • Apply the Severity table (SEV1-SEV4) to classify impact and align the response.
  • Follow the Diagnosis decision tree to identify root causes and decide on safe mitigations.
  • Communicate status updates and next steps to stakeholders; escalate as needed.

Example Use Cases

  • Outage caused by a recent deployment; used Phase 1 triage and rollback to restore service.
  • Memory exhaustion led to degraded performance; scaled deployment and restarted affected pods.
  • External dependency failure (database/API) detected; implemented fallback and retries.
  • Traffic spike with no security issue; enabled rate limiting and autoscaling to handle load.
  • Certificate expiry caused authentication failures; rotated credentials and resecured services.

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers