What is incident-triage?

A structured, repeatable first-response workflow that helps on-call teams assess impact, gather data, and coordinate fixes during production incidents.

What should I collect during triage?

Key data includes logs, metrics, and traces from affected services to understand cause and scope.

How does this aid communication?

By documenting actions and decisions in real time, it keeps stakeholders informed and aligns teams on next steps.

incident-triage

npx machina-cli add skill tomkraaij/ai-skills-librarian/incident-triage --openclaw

Files (1)

SKILL.md

359 B

Incident Triage

Confirm impact and blast radius.
Collect logs, metrics, and traces.
Coordinate mitigation and communication.

Source

git clone https://github.com/tomkraaij/ai-skills-librarian/blob/main/examples/skill-pack-example/skills/incident-triage/SKILL.md

View on GitHub

Overview

Incident-triage provides a repeatable first-response workflow for production incidents. By quickly confirming impact and blast radius, collecting logs, metrics, and traces, and coordinating mitigation and communication, teams reduce mean time to restore and improve reliability.

How This Skill Works

When an incident is detected, the skill guides responders through three core actions: confirm impact and blast radius, collect logs, metrics, and traces, then coordinate mitigation and stakeholder communication. This creates a traceable, auditable response and aligns teams on the next steps.

When to Use It

During a new production outage affecting user-facing services
When blast radius or impacted services are unclear
When logs, metrics, or traces are scattered across systems
During incident war rooms or escalation discussions
When preparing for post-incident reviews and RCA

Quick Start

Step 1: Detect incident and call out suspected impact and blast radius
Step 2: Collect logs, metrics, and traces from affected components
Step 3: Initiate mitigation actions and communicate status to stakeholders

Best Practices

Confirm impact and blast radius before escalating
Centralize and share logs, metrics, and traces
Document actions taken and decisions made in real time
Coordinate mitigation efforts with clear ownership
Communicate status and next steps to stakeholders promptly

Example Use Cases

API latency spike with elevated error rates across multiple services
Outage in a critical payment path requiring rapid triage
Partial outage where only a subset of regions is affected
Sustained degradation detected via dashboards and on-call alerts
Third-party dependency failure leading to cascading issues

Frequently Asked Questions

Add this skill to your agents