incident-ready
Scannednpx machina-cli add skill rana/yogananda-skills/incident-ready --openclawRead all project markdown documents and examine the codebase to ground in the project's actual operational state — who's on call, what monitoring exists, what procedures are documented.
Incident Readiness Audit
1. Failure Scenario Coverage
For the specific system, identify the top failure scenarios and evaluate runbook coverage:
| Failure class | Examples | Covered? |
|---|---|---|
| Service outage | Primary database down, hosting provider outage, DNS failure | |
| Data corruption | Bad migration, ingestion error, cache poisoning | |
| Cost runaway | API abuse, unbounded LLM calls, storage explosion | |
| Security incident | Credential leak, unauthorized access, content tampering | |
| Dependency failure | Third-party API down, certificate expiry, rate limit hit | |
| Performance degradation | Slow queries, memory leak, cache miss storm | |
| Content integrity | Wrong text displayed, citations mismatched, search returning irrelevant results | |
| Human error | Deployed wrong branch, deleted production data, misconfigured environment |
2. Severity Classification
Does the project have a severity framework? Evaluate or propose:
| Severity | Definition | Response time | Who responds |
|---|---|---|---|
| SEV-1 | System unavailable, data loss risk, security breach | Immediate | On-call + lead |
| SEV-2 | Major feature degraded, significant user impact | Within 1 hour | On-call |
| SEV-3 | Minor feature degraded, workaround exists | Within 1 business day | Assigned engineer |
| SEV-4 | Cosmetic issue, no functional impact | Next sprint | Backlog |
3. Escalation Path
- Who is the first responder? Is this documented?
- What's the escalation chain? First responder → team lead → engineering manager → vendor support?
- Are vendor support contacts documented with account numbers and SLAs?
- What happens outside business hours?
- What happens when the primary on-call is unavailable?
4. Runbook Quality
For each documented runbook, evaluate:
- Detection — How is this failure detected? Alert? User report? Monitoring gap?
- Diagnosis — What commands or dashboards confirm the problem?
- Remediation — Step-by-step fix. Can someone unfamiliar with the system follow it?
- Verification — How do you confirm the fix worked?
- Communication — Who needs to be informed? Users? Stakeholders? Vendors?
- Post-incident — What data should be preserved for review?
5. Communication Templates
Are templates ready for:
- Internal incident notification (Slack, email)
- External status update (status page, user communication)
- Post-incident summary (what happened, impact, remediation, prevention)
- Vendor communication (support ticket with relevant diagnostics)
6. Recovery Procedures
- Database restore — Can you restore from backup? What's the procedure? Has it been tested?
- Deployment rollback — One-command rollback? Or multi-step?
- Configuration rollback — Can Terraform state be reverted? Environment variables restored?
- Data reconciliation — After a partial failure, how do you reconcile inconsistent state?
7. Post-Incident Review Process
- Is there a defined process for post-incident review?
- Who participates? Is it blameless?
- Where are findings documented?
- How are action items tracked to completion?
- Is there a pattern library of past incidents?
8. Operational Knowledge Distribution
- Bus factor — How many people can handle each failure scenario?
- Is operational knowledge documented or tribal?
- Can the on-call person resolve SEV-1 without calling someone else?
- Are vendor credentials and access shared securely (not in one person's head)?
Focus area: $ARGUMENTS
For every gap:
- What's missing — Specific scenario or procedure
- Risk if unaddressed — What happens when this failure occurs without a runbook?
- Proposed remediation — Draft the runbook outline, escalation path, or template
- Priority — Must-have before production / Should-have within 30 days / Nice-to-have
Present as a readiness assessment. No changes to files — document only.
Output Management
Hard constraints:
- Segment output into groups of up to 10 findings, ordered by SEV-1 scenario gaps first.
- If no $ARGUMENTS focus area is given, evaluate only sections 1 (Failure Scenarios), 2 (Severity Classification), and 3 (Escalation Path) — the core triage chain.
- Write each segment incrementally. Do not accumulate a single large response.
- After completing each segment, continue immediately to the next. Do not wait for user input.
- Continue until ALL findings are reported. State the total count when complete.
- If the analysis surface is too large to complete in one session, state what was covered and what remains.
Document reading strategy:
- Read project documentation selectively. Start with README, deployment docs, and any existing runbooks.
- Only examine codebase sections relevant to the focused area (monitoring config, error handling, infrastructure-as-code).
What's the most likely first real incident — and is the team ready for it?
What operational knowledge exists only in someone's head?
Source
git clone https://github.com/rana/yogananda-skills/blob/main/skills/incident-ready/SKILL.mdView on GitHub Overview
Incident-ready evaluates runbook coverage, escalation paths, severity classification, communication templates, rollback procedures, and post-incident review process. It guides transitions from development to production and identifies gaps at phase boundaries or after real incidents.
How This Skill Works
It conducts a structured audit across eight areas: failure scenario coverage, severity framework, escalation path, runbook quality, communication templates, recovery procedures, post-incident review, and operational knowledge distribution. Teams collect evidence like on-call rosters, monitoring, and documented procedures, then produce a prioritized action plan to close gaps.
When to Use It
- During transitions from development to production (pre-prod cutover)
- At phase boundaries or major architecture changes
- After a real incident reveals gaps in readiness
- Before major releases or launches of critical features
- During on-call drills or game-day exercises to validate readiness
Quick Start
- Step 1: Define audit scope and collect runbooks, escalation contacts, and templates
- Step 2: Assess each failure scenario against runbook coverage and severity definitions
- Step 3: Identify gaps, assign owners, implement changes, and schedule a post-incident review
Best Practices
- Map top failure scenarios to concrete runbooks and coverage checklists
- Define a severity framework with explicit response times and responders
- Document escalation paths, vendor contacts, and after-hours procedures
- Vet runbooks with validation tests covering detection, diagnosis, remediation, and verification
- Capture lessons in a post-incident review and maintain a living knowledge base
Example Use Cases
- Before prod release, a gap in DNS failure runbooks is found and a rollback procedure is added
- An incident reveals missing escalation paths; an on-call to vendor contact chain is documented
- A data pipeline outage uncovers absent post-incident review process, leading to a new review template
- Deployment drift prompts the addition of deployment rollback and configuration rollback steps
- An incident drill highlights under-documented internal incident notifications and templated updates