What is incident-ready used for?

To evaluate and improve readiness across runbooks, escalation, severity, and communications before prod or after incidents.

Who should own the incident-ready process?

Typically on-call leaders, site reliability engineers, and incident managers, with cross-team participation.

How is incident-ready different from a post-incident review?

Incident-ready is a pre-emptive audit that builds preparedness, while a post-incident review analyzes what happened after an actual incident.

incident-ready

Scanned

npx machina-cli add skill rana/yogananda-skills/incident-ready --openclaw

Files (1)

SKILL.md

5.6 KB

Read all project markdown documents and examine the codebase to ground in the project's actual operational state — who's on call, what monitoring exists, what procedures are documented.

Incident Readiness Audit

1. Failure Scenario Coverage

For the specific system, identify the top failure scenarios and evaluate runbook coverage:

Failure class	Examples	Covered?
Service outage	Primary database down, hosting provider outage, DNS failure
Data corruption	Bad migration, ingestion error, cache poisoning
Cost runaway	API abuse, unbounded LLM calls, storage explosion
Security incident	Credential leak, unauthorized access, content tampering
Dependency failure	Third-party API down, certificate expiry, rate limit hit
Performance degradation	Slow queries, memory leak, cache miss storm
Content integrity	Wrong text displayed, citations mismatched, search returning irrelevant results
Human error	Deployed wrong branch, deleted production data, misconfigured environment

2. Severity Classification

Does the project have a severity framework? Evaluate or propose:

Severity	Definition	Response time	Who responds
SEV-1	System unavailable, data loss risk, security breach	Immediate	On-call + lead
SEV-2	Major feature degraded, significant user impact	Within 1 hour	On-call
SEV-3	Minor feature degraded, workaround exists	Within 1 business day	Assigned engineer
SEV-4	Cosmetic issue, no functional impact	Next sprint	Backlog

3. Escalation Path

Who is the first responder? Is this documented?
What's the escalation chain? First responder → team lead → engineering manager → vendor support?
Are vendor support contacts documented with account numbers and SLAs?
What happens outside business hours?
What happens when the primary on-call is unavailable?

4. Runbook Quality

For each documented runbook, evaluate:

Detection — How is this failure detected? Alert? User report? Monitoring gap?
Diagnosis — What commands or dashboards confirm the problem?
Remediation — Step-by-step fix. Can someone unfamiliar with the system follow it?
Verification — How do you confirm the fix worked?
Communication — Who needs to be informed? Users? Stakeholders? Vendors?
Post-incident — What data should be preserved for review?

5. Communication Templates

Are templates ready for:

Internal incident notification (Slack, email)
External status update (status page, user communication)
Post-incident summary (what happened, impact, remediation, prevention)
Vendor communication (support ticket with relevant diagnostics)

6. Recovery Procedures

Database restore — Can you restore from backup? What's the procedure? Has it been tested?
Deployment rollback — One-command rollback? Or multi-step?
Configuration rollback — Can Terraform state be reverted? Environment variables restored?
Data reconciliation — After a partial failure, how do you reconcile inconsistent state?

7. Post-Incident Review Process

Is there a defined process for post-incident review?
Who participates? Is it blameless?
Where are findings documented?
How are action items tracked to completion?
Is there a pattern library of past incidents?

8. Operational Knowledge Distribution

Bus factor — How many people can handle each failure scenario?
Is operational knowledge documented or tribal?
Can the on-call person resolve SEV-1 without calling someone else?
Are vendor credentials and access shared securely (not in one person's head)?

Focus area: $ARGUMENTS

For every gap:

What's missing — Specific scenario or procedure
Risk if unaddressed — What happens when this failure occurs without a runbook?
Proposed remediation — Draft the runbook outline, escalation path, or template
Priority — Must-have before production / Should-have within 30 days / Nice-to-have

Present as a readiness assessment. No changes to files — document only.

Output Management

Hard constraints:

Segment output into groups of up to 10 findings, ordered by SEV-1 scenario gaps first.
If no $ARGUMENTS focus area is given, evaluate only sections 1 (Failure Scenarios), 2 (Severity Classification), and 3 (Escalation Path) — the core triage chain.
Write each segment incrementally. Do not accumulate a single large response.
After completing each segment, continue immediately to the next. Do not wait for user input.
Continue until ALL findings are reported. State the total count when complete.
If the analysis surface is too large to complete in one session, state what was covered and what remains.

Document reading strategy:

Read project documentation selectively. Start with README, deployment docs, and any existing runbooks.
Only examine codebase sections relevant to the focused area (monitoring config, error handling, infrastructure-as-code).

What's the most likely first real incident — and is the team ready for it?

What operational knowledge exists only in someone's head?

Source

git clone https://github.com/rana/yogananda-skills/blob/main/skills/incident-ready/SKILL.mdView on GitHub

Overview

Incident-ready evaluates runbook coverage, escalation paths, severity classification, communication templates, rollback procedures, and post-incident review process. It guides transitions from development to production and identifies gaps at phase boundaries or after real incidents.

How This Skill Works

It conducts a structured audit across eight areas: failure scenario coverage, severity framework, escalation path, runbook quality, communication templates, recovery procedures, post-incident review, and operational knowledge distribution. Teams collect evidence like on-call rosters, monitoring, and documented procedures, then produce a prioritized action plan to close gaps.

When to Use It

During transitions from development to production (pre-prod cutover)
At phase boundaries or major architecture changes
After a real incident reveals gaps in readiness
Before major releases or launches of critical features
During on-call drills or game-day exercises to validate readiness

Quick Start

Step 1: Define audit scope and collect runbooks, escalation contacts, and templates
Step 2: Assess each failure scenario against runbook coverage and severity definitions
Step 3: Identify gaps, assign owners, implement changes, and schedule a post-incident review

Best Practices

Map top failure scenarios to concrete runbooks and coverage checklists
Define a severity framework with explicit response times and responders
Document escalation paths, vendor contacts, and after-hours procedures
Vet runbooks with validation tests covering detection, diagnosis, remediation, and verification
Capture lessons in a post-incident review and maintain a living knowledge base

Example Use Cases

Before prod release, a gap in DNS failure runbooks is found and a rollback procedure is added
An incident reveals missing escalation paths; an on-call to vendor contact chain is documented
A data pipeline outage uncovers absent post-incident review process, leading to a new review template
Deployment drift prompts the addition of deployment rollback and configuration rollback steps
An incident drill highlights under-documented internal incident notifications and templated updates

Frequently Asked Questions

Add this skill to your agents