Get the FREE Ultimate OpenClaw Setup Guide →

incident-triage

npx machina-cli add skill tomkraaij/ai-skills-librarian/incident-triage --openclaw
Files (1)
SKILL.md
359 B

Incident Triage

  1. Confirm impact and blast radius.
  2. Collect logs, metrics, and traces.
  3. Coordinate mitigation and communication.

Source

git clone https://github.com/tomkraaij/ai-skills-librarian/blob/main/examples/skill-pack-example/skills/incident-triage/SKILL.mdView on GitHub

Overview

Incident-triage provides a repeatable first-response workflow for production incidents. By quickly confirming impact and blast radius, collecting logs, metrics, and traces, and coordinating mitigation and communication, teams reduce mean time to restore and improve reliability.

How This Skill Works

When an incident is detected, the skill guides responders through three core actions: confirm impact and blast radius, collect logs, metrics, and traces, then coordinate mitigation and stakeholder communication. This creates a traceable, auditable response and aligns teams on the next steps.

When to Use It

  • During a new production outage affecting user-facing services
  • When blast radius or impacted services are unclear
  • When logs, metrics, or traces are scattered across systems
  • During incident war rooms or escalation discussions
  • When preparing for post-incident reviews and RCA

Quick Start

  1. Step 1: Detect incident and call out suspected impact and blast radius
  2. Step 2: Collect logs, metrics, and traces from affected components
  3. Step 3: Initiate mitigation actions and communicate status to stakeholders

Best Practices

  • Confirm impact and blast radius before escalating
  • Centralize and share logs, metrics, and traces
  • Document actions taken and decisions made in real time
  • Coordinate mitigation efforts with clear ownership
  • Communicate status and next steps to stakeholders promptly

Example Use Cases

  • API latency spike with elevated error rates across multiple services
  • Outage in a critical payment path requiring rapid triage
  • Partial outage where only a subset of regions is affected
  • Sustained degradation detected via dashboards and on-call alerts
  • Third-party dependency failure leading to cascading issues

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers