What is incident-response in this skill?

A structured workflow for diagnosing and resolving production incidents, spanning Triage, Context Gathering, Diagnosis, and Mitigation.

What tools are required?

At minimum kubectl, docker, and aws, as listed in metadata; you will also use logs, events, and metrics during the process.

How is severity determined?

Use Phase 1 quick assessment questions to gauge impact and refer to the SEV1-SEV4 table to classify the incident and guide response.

incident-response

Scanned

npx machina-cli add skill agenticdevops/devops-execution-engine/incident-response --openclaw

Files (1)

SKILL.md

6.5 KB

Incident Response

Structured workflow for diagnosing and resolving production incidents.

When to Use This Skill

Use this skill when:

An alert has fired
Users report issues
Monitoring shows anomalies
System behavior is unexpected
You're the on-call responder

Incident Response Phases

Phase 1: Triage (First 5 minutes)

Goal: Assess severity and impact.

Quick Assessment Questions

What is broken? (service, feature, infrastructure)
Who is affected? (all users, subset, internal only)
When did it start? (correlate with deployments/changes)
Is it getting worse? (check error rate trend)

Severity Classification

Severity	Impact	Response
SEV1	Complete outage, all users affected	All hands, exec notification
SEV2	Major feature broken, many users affected	Primary + backup on-call
SEV3	Minor feature broken, some users affected	Primary on-call
SEV4	No user impact, potential issue	Next business day

Phase 2: Context Gathering (5-15 minutes)

Goal: Collect data to understand the problem.

Kubernetes Context

# Cluster health overview
kubectl get nodes
kubectl get pods -A | grep -v Running | grep -v Completed

# Recent events
kubectl get events -A --sort-by='.lastTimestamp' | tail -30

# Check specific service
kubectl get pods -l app=<service-name> -o wide
kubectl logs -l app=<service-name> --tail=100

Recent Changes

# Recent deployments
kubectl get deployments -A -o custom-columns=\
'NAMESPACE:.metadata.namespace,NAME:.metadata.name,UPDATED:.metadata.creationTimestamp' \
| sort -k3 -r | head -10

# Git history (if in repo)
git log --oneline --since="2 hours ago"

Metrics Check

# If Prometheus available
# Check error rates, latency, traffic

# If CloudWatch
aws cloudwatch get-metric-statistics \
  --namespace <namespace> \
  --metric-name <metric> \
  --start-time $(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 \
  --statistics Average

Phase 3: Diagnosis (15-30 minutes)

Goal: Identify root cause.

Common Root Causes

Recent deployment - Check if timing correlates
Resource exhaustion - CPU, memory, disk, connections
External dependency failure - Database, API, DNS
Configuration change - ConfigMaps, secrets, feature flags
Traffic spike - Unexpected load
Certificate/credential expiry - Auth failures
Infrastructure issue - Node failure, network partition

Diagnosis Decision Tree

Is there a recent deployment?
├── Yes → Check deployment logs, consider rollback
└── No → Continue

Are pods crashing?
├── Yes → Check logs: kubectl logs <pod> --previous
└── No → Continue

Are resources exhausted?
├── Yes → Scale up or optimize
└── No → Continue

Is an external dependency failing?
├── Yes → Check dependency status, implement fallback
└── No → Continue

Is there a traffic spike?
├── Yes → Scale up, enable rate limiting
└── No → Escalate for deeper investigation

Phase 4: Mitigation (ASAP)

Goal: Restore service, even if root cause unknown.

Quick Mitigations

Rollback Deployment:

# Kubernetes rollback
kubectl rollout undo deployment/<name> -n <namespace>
kubectl rollout status deployment/<name> -n <namespace>

Scale Up:

# Increase replicas
kubectl scale deployment/<name> --replicas=5 -n <namespace>

Restart Pods:

# Rolling restart
kubectl rollout restart deployment/<name> -n <namespace>

Toggle Feature Flag:

# If feature flags available, disable problematic feature

Redirect Traffic:

# If multiple regions, redirect away from affected region

Phase 5: Communication

Goal: Keep stakeholders informed.

Status Update Template

**Incident Update - [SERVICE] - [SEV LEVEL]**

**Status:** Investigating / Identified / Mitigating / Resolved
**Impact:** [Who/what is affected]
**Start Time:** [When it started]
**Current Actions:** [What we're doing]
**Next Update:** [When to expect next update]

---
Incident Commander: [Name]

Communication Cadence

Severity	Update Frequency
SEV1	Every 15 minutes
SEV2	Every 30 minutes
SEV3	Every hour
SEV4	End of day

Phase 6: Resolution

Goal: Confirm service is restored.

Verification Checklist

Close Incident

**Incident Resolved - [SERVICE]**

**Duration:** [Start] to [End] ([X] minutes)
**Root Cause:** [Brief description]
**Resolution:** [What fixed it]
**Follow-ups:** [Links to action items]

Post-mortem scheduled: [Date/Time]

Post-Incident

Post-Mortem Template

# Incident Post-Mortem: [Title]

**Date:** [Date]
**Duration:** [Duration]
**Severity:** [SEV Level]
**Authors:** [Names]

## Summary
[2-3 sentence summary]

## Impact
- Users affected: [Number/percentage]
- Revenue impact: [If applicable]
- SLA impact: [If applicable]

## Timeline
- HH:MM - [Event]
- HH:MM - [Event]
- HH:MM - [Event]

## Root Cause
[Detailed explanation]

## Resolution
[How it was fixed]

## Lessons Learned
### What went well
- [Item]

### What could be improved
- [Item]

## Action Items
- [ ] [Action] - Owner: [Name] - Due: [Date]
- [ ] [Action] - Owner: [Name] - Due: [Date]

Quick Reference

Essential Commands

# Quick cluster health
kubectl get nodes && kubectl get pods -A | grep -v Running

# Service status
kubectl get pods,svc,endpoints -l app=<service>

# Recent events
kubectl get events --sort-by='.lastTimestamp' | tail -20

# Quick logs
kubectl logs -l app=<service> --tail=50 --all-containers

Escalation Contacts

Document your escalation path:

Primary on-call
Secondary on-call
Team lead
Engineering manager
VP Engineering (SEV1 only)

Related Skills

k8s-debug: For Kubernetes-specific debugging
log-analysis: For log pattern analysis
argocd-gitops: For GitOps rollbacks

Source

git clone https://github.com/agenticdevops/devops-execution-engine/blob/main/skills/incident-response/SKILL.mdView on GitHub

Overview

Incident Response provides a structured workflow for diagnosing and resolving production incidents. It guides triage, context gathering, diagnosis, and rapid mitigation to restore services.

How This Skill Works

The skill divides incidents into four phases: Triage, Context Gathering, Diagnosis, and Mitigation. It leverages real-time data from Kubernetes, logs, and metrics to identify root causes and applies safe mitigations like rollbacks or scaling to restore functionality.

When to Use It

An alert has fired
Users report issues
Monitoring shows anomalies
System behavior is unexpected
You're the on-call responder

Quick Start

Step 1: Identify incident and perform Phase 1 triage to classify severity (SEV1-SEV4).
Step 2: Gather context with Kubernetes commands, recent changes, and metrics.
Step 3: Mitigate quickly with rollback/scale/restart, then document and communicate.

Best Practices

Start with Phase 1 quick assessment questions to gauge severity and impact.
Gather context efficiently (pods, deployments, events, logs, metrics) using kubectl, logs, and monitoring tools.
Apply the Severity table (SEV1-SEV4) to classify impact and align the response.
Follow the Diagnosis decision tree to identify root causes and decide on safe mitigations.
Communicate status updates and next steps to stakeholders; escalate as needed.

Example Use Cases

Outage caused by a recent deployment; used Phase 1 triage and rollback to restore service.
Memory exhaustion led to degraded performance; scaled deployment and restarted affected pods.
External dependency failure (database/API) detected; implemented fallback and retries.
Traffic spike with no security issue; enabled rate limiting and autoscaling to handle load.
Certificate expiry caused authentication failures; rotated credentials and resecured services.

Frequently Asked Questions

Add this skill to your agents