disaster-recovery
Scannednpx machina-cli add skill BagelHole/DevOps-Security-Agent-Skills/disaster-recovery --openclawFiles (1)
SKILL.md
1.6 KB
Disaster Recovery
Implement disaster recovery strategies and procedures.
DR Metrics
recovery_metrics:
RTO: Recovery Time Objective
- Maximum acceptable downtime
- How long to restore service
RPO: Recovery Point Objective
- Maximum acceptable data loss
- How much data can be lost
DR Strategies
| Strategy | RTO | RPO | Cost |
|---|---|---|---|
| Backup & Restore | Hours | Hours | $ |
| Pilot Light | Minutes-Hours | Minutes | $$ |
| Warm Standby | Minutes | Seconds | $$$ |
| Multi-Site Active | Near-zero | Near-zero | $$$$ |
AWS Multi-Region
# Cross-region RDS replica
aws rds create-db-instance-read-replica \
--db-instance-identifier dr-replica \
--source-db-instance-identifier prod-db \
--source-region us-east-1 \
--region us-west-2
# S3 cross-region replication
aws s3api put-bucket-replication \
--bucket source-bucket \
--replication-configuration file://replication.json
DR Testing
dr_test_schedule:
tabletop: Quarterly
component_failover: Monthly
full_failover: Annually
test_checklist:
- [ ] Verify backup integrity
- [ ] Test failover procedures
- [ ] Validate data consistency
- [ ] Measure actual RTO/RPO
- [ ] Document lessons learned
Best Practices
- Regular DR testing
- Automate failover where possible
- Document all procedures
- Update runbooks after tests
Source
git clone https://github.com/BagelHole/DevOps-Security-Agent-Skills/blob/main/compliance/continuity/disaster-recovery/SKILL.mdView on GitHub Overview
This skill guides the implementation of disaster recovery strategies and runbooks to meet RTO and RPO targets. It covers DR metrics, strategy options, AWS multi-region options, testing, and maintaining up-to-date runbooks for business continuity.
How This Skill Works
Define DR metrics (RTO and RPO) to establish acceptable downtime and data loss. Choose a DR strategy (Backup & Restore, Pilot Light, Warm Standby, or Multi-Site Active) and configure backups, replication, and automated failover. Regularly run DR tests and update runbooks based on lessons learned.
When to Use It
- Planning for business continuity and resilience in case of outages.
- Assessing data loss risk and aligning RPO targets with business needs.
- Implementing cross-region resiliency using AWS multi-region patterns.
- Validating DR readiness through scheduled testing (tabletop, component failover, full failover).
- Documenting and updating DR runbooks after tests and incidents.
Quick Start
- Step 1: Define DR metrics (RTO, RPO) and select a DR strategy (Backup & Restore, Pilot Light, Warm Standby, or Multi-Site Active).
- Step 2: Implement backups/replication and automate failover for the chosen strategy.
- Step 3: Schedule regular DR tests (tabletop, component failover, full failover) and update runbooks after each test.
Best Practices
- Regular DR testing
- Automate failover where possible
- Document all procedures
- Update runbooks after tests
- Consider cross-region replication (AWS Multi-Region) to improve resilience
Example Use Cases
- A company uses Backup & Restore with weekly backups and hours-long RTO, ensuring rapid recovery from data loss.
- An application implements Pilot Light with a cross-region RDS replica to enable quick restoration in another region.
- A fintech app deploys Warm Standby to achieve minimal downtime and conducts frequent failover checks.
- An enterprise runs Multi-Site Active across two Regions with near-zero RTO/RPO for critical services.
- DR testing schedules include tabletop (quarterly), component failover (monthly), and full failover (annually) with a lessons-learned process.
Frequently Asked Questions
Add this skill to your agents