Get the FREE Ultimate OpenClaw Setup Guide →

incident-ready

Scanned
npx machina-cli add skill rana/yogananda-skills/incident-ready --openclaw
Files (1)
SKILL.md
5.6 KB

Read all project markdown documents and examine the codebase to ground in the project's actual operational state — who's on call, what monitoring exists, what procedures are documented.

Incident Readiness Audit

1. Failure Scenario Coverage

For the specific system, identify the top failure scenarios and evaluate runbook coverage:

Failure classExamplesCovered?
Service outagePrimary database down, hosting provider outage, DNS failure
Data corruptionBad migration, ingestion error, cache poisoning
Cost runawayAPI abuse, unbounded LLM calls, storage explosion
Security incidentCredential leak, unauthorized access, content tampering
Dependency failureThird-party API down, certificate expiry, rate limit hit
Performance degradationSlow queries, memory leak, cache miss storm
Content integrityWrong text displayed, citations mismatched, search returning irrelevant results
Human errorDeployed wrong branch, deleted production data, misconfigured environment

2. Severity Classification

Does the project have a severity framework? Evaluate or propose:

SeverityDefinitionResponse timeWho responds
SEV-1System unavailable, data loss risk, security breachImmediateOn-call + lead
SEV-2Major feature degraded, significant user impactWithin 1 hourOn-call
SEV-3Minor feature degraded, workaround existsWithin 1 business dayAssigned engineer
SEV-4Cosmetic issue, no functional impactNext sprintBacklog

3. Escalation Path

  • Who is the first responder? Is this documented?
  • What's the escalation chain? First responder → team lead → engineering manager → vendor support?
  • Are vendor support contacts documented with account numbers and SLAs?
  • What happens outside business hours?
  • What happens when the primary on-call is unavailable?

4. Runbook Quality

For each documented runbook, evaluate:

  • Detection — How is this failure detected? Alert? User report? Monitoring gap?
  • Diagnosis — What commands or dashboards confirm the problem?
  • Remediation — Step-by-step fix. Can someone unfamiliar with the system follow it?
  • Verification — How do you confirm the fix worked?
  • Communication — Who needs to be informed? Users? Stakeholders? Vendors?
  • Post-incident — What data should be preserved for review?

5. Communication Templates

Are templates ready for:

  • Internal incident notification (Slack, email)
  • External status update (status page, user communication)
  • Post-incident summary (what happened, impact, remediation, prevention)
  • Vendor communication (support ticket with relevant diagnostics)

6. Recovery Procedures

  • Database restore — Can you restore from backup? What's the procedure? Has it been tested?
  • Deployment rollback — One-command rollback? Or multi-step?
  • Configuration rollback — Can Terraform state be reverted? Environment variables restored?
  • Data reconciliation — After a partial failure, how do you reconcile inconsistent state?

7. Post-Incident Review Process

  • Is there a defined process for post-incident review?
  • Who participates? Is it blameless?
  • Where are findings documented?
  • How are action items tracked to completion?
  • Is there a pattern library of past incidents?

8. Operational Knowledge Distribution

  • Bus factor — How many people can handle each failure scenario?
  • Is operational knowledge documented or tribal?
  • Can the on-call person resolve SEV-1 without calling someone else?
  • Are vendor credentials and access shared securely (not in one person's head)?

Focus area: $ARGUMENTS

For every gap:

  1. What's missing — Specific scenario or procedure
  2. Risk if unaddressed — What happens when this failure occurs without a runbook?
  3. Proposed remediation — Draft the runbook outline, escalation path, or template
  4. Priority — Must-have before production / Should-have within 30 days / Nice-to-have

Present as a readiness assessment. No changes to files — document only.

Output Management

Hard constraints:

  • Segment output into groups of up to 10 findings, ordered by SEV-1 scenario gaps first.
  • If no $ARGUMENTS focus area is given, evaluate only sections 1 (Failure Scenarios), 2 (Severity Classification), and 3 (Escalation Path) — the core triage chain.
  • Write each segment incrementally. Do not accumulate a single large response.
  • After completing each segment, continue immediately to the next. Do not wait for user input.
  • Continue until ALL findings are reported. State the total count when complete.
  • If the analysis surface is too large to complete in one session, state what was covered and what remains.

Document reading strategy:

  • Read project documentation selectively. Start with README, deployment docs, and any existing runbooks.
  • Only examine codebase sections relevant to the focused area (monitoring config, error handling, infrastructure-as-code).

What's the most likely first real incident — and is the team ready for it?

What operational knowledge exists only in someone's head?

Source

git clone https://github.com/rana/yogananda-skills/blob/main/skills/incident-ready/SKILL.mdView on GitHub

Overview

Incident-ready evaluates runbook coverage, escalation paths, severity classification, communication templates, rollback procedures, and post-incident review process. It guides transitions from development to production and identifies gaps at phase boundaries or after real incidents.

How This Skill Works

It conducts a structured audit across eight areas: failure scenario coverage, severity framework, escalation path, runbook quality, communication templates, recovery procedures, post-incident review, and operational knowledge distribution. Teams collect evidence like on-call rosters, monitoring, and documented procedures, then produce a prioritized action plan to close gaps.

When to Use It

  • During transitions from development to production (pre-prod cutover)
  • At phase boundaries or major architecture changes
  • After a real incident reveals gaps in readiness
  • Before major releases or launches of critical features
  • During on-call drills or game-day exercises to validate readiness

Quick Start

  1. Step 1: Define audit scope and collect runbooks, escalation contacts, and templates
  2. Step 2: Assess each failure scenario against runbook coverage and severity definitions
  3. Step 3: Identify gaps, assign owners, implement changes, and schedule a post-incident review

Best Practices

  • Map top failure scenarios to concrete runbooks and coverage checklists
  • Define a severity framework with explicit response times and responders
  • Document escalation paths, vendor contacts, and after-hours procedures
  • Vet runbooks with validation tests covering detection, diagnosis, remediation, and verification
  • Capture lessons in a post-incident review and maintain a living knowledge base

Example Use Cases

  • Before prod release, a gap in DNS failure runbooks is found and a rollback procedure is added
  • An incident reveals missing escalation paths; an on-call to vendor contact chain is documented
  • A data pipeline outage uncovers absent post-incident review process, leading to a new review template
  • Deployment drift prompts the addition of deployment rollback and configuration rollback steps
  • An incident drill highlights under-documented internal incident notifications and templated updates

Frequently Asked Questions

Add this skill to your agents
Sponsor this space

Reach thousands of developers