incident-triage
npx machina-cli add skill tomkraaij/ai-skills-librarian/incident-triage --openclawFiles (1)
SKILL.md
359 B
Incident Triage
- Confirm impact and blast radius.
- Collect logs, metrics, and traces.
- Coordinate mitigation and communication.
Source
git clone https://github.com/tomkraaij/ai-skills-librarian/blob/main/examples/skill-pack-example/skills/incident-triage/SKILL.mdView on GitHub Overview
Incident-triage provides a repeatable first-response workflow for production incidents. By quickly confirming impact and blast radius, collecting logs, metrics, and traces, and coordinating mitigation and communication, teams reduce mean time to restore and improve reliability.
How This Skill Works
When an incident is detected, the skill guides responders through three core actions: confirm impact and blast radius, collect logs, metrics, and traces, then coordinate mitigation and stakeholder communication. This creates a traceable, auditable response and aligns teams on the next steps.
When to Use It
- During a new production outage affecting user-facing services
- When blast radius or impacted services are unclear
- When logs, metrics, or traces are scattered across systems
- During incident war rooms or escalation discussions
- When preparing for post-incident reviews and RCA
Quick Start
- Step 1: Detect incident and call out suspected impact and blast radius
- Step 2: Collect logs, metrics, and traces from affected components
- Step 3: Initiate mitigation actions and communicate status to stakeholders
Best Practices
- Confirm impact and blast radius before escalating
- Centralize and share logs, metrics, and traces
- Document actions taken and decisions made in real time
- Coordinate mitigation efforts with clear ownership
- Communicate status and next steps to stakeholders promptly
Example Use Cases
- API latency spike with elevated error rates across multiple services
- Outage in a critical payment path requiring rapid triage
- Partial outage where only a subset of regions is affected
- Sustained degradation detected via dashboards and on-call alerts
- Third-party dependency failure leading to cascading issues
Frequently Asked Questions
Add this skill to your agents