What are RTO and RPO?

RTO is the maximum acceptable downtime, and RPO is the maximum acceptable data loss; both measure recovery objectives.

What are the DR strategies covered?

Backup & Restore, Pilot Light, Warm Standby, and Multi-Site Active, each with different RTO/RPO profiles and costs.

How often should DR testing be performed?

Tabletop tests quarterly, component failover monthly, and full failover annually, with runbooks updated after each test.

disaster-recovery

Scanned

npx machina-cli add skill BagelHole/DevOps-Security-Agent-Skills/disaster-recovery --openclaw

Files (1)

SKILL.md

1.6 KB

Disaster Recovery

Implement disaster recovery strategies and procedures.

DR Metrics

recovery_metrics:
  RTO: Recovery Time Objective
    - Maximum acceptable downtime
    - How long to restore service
    
  RPO: Recovery Point Objective
    - Maximum acceptable data loss
    - How much data can be lost

DR Strategies

Strategy	RTO	RPO	Cost
Backup & Restore	Hours	Hours	$
Pilot Light	Minutes-Hours	Minutes	$$
Warm Standby	Minutes	Seconds	$$$
Multi-Site Active	Near-zero	Near-zero	$$$$

AWS Multi-Region

# Cross-region RDS replica
aws rds create-db-instance-read-replica \
  --db-instance-identifier dr-replica \
  --source-db-instance-identifier prod-db \
  --source-region us-east-1 \
  --region us-west-2

# S3 cross-region replication
aws s3api put-bucket-replication \
  --bucket source-bucket \
  --replication-configuration file://replication.json

DR Testing

dr_test_schedule:
  tabletop: Quarterly
  component_failover: Monthly
  full_failover: Annually
  
test_checklist:
  - [ ] Verify backup integrity
  - [ ] Test failover procedures
  - [ ] Validate data consistency
  - [ ] Measure actual RTO/RPO
  - [ ] Document lessons learned

Best Practices

Regular DR testing
Automate failover where possible
Document all procedures
Update runbooks after tests

Source

git clone https://github.com/BagelHole/DevOps-Security-Agent-Skills/blob/main/compliance/continuity/disaster-recovery/SKILL.md

View on GitHub

Overview

This skill guides the implementation of disaster recovery strategies and runbooks to meet RTO and RPO targets. It covers DR metrics, strategy options, AWS multi-region options, testing, and maintaining up-to-date runbooks for business continuity.

How This Skill Works

Define DR metrics (RTO and RPO) to establish acceptable downtime and data loss. Choose a DR strategy (Backup & Restore, Pilot Light, Warm Standby, or Multi-Site Active) and configure backups, replication, and automated failover. Regularly run DR tests and update runbooks based on lessons learned.

When to Use It

Planning for business continuity and resilience in case of outages.
Assessing data loss risk and aligning RPO targets with business needs.
Implementing cross-region resiliency using AWS multi-region patterns.
Validating DR readiness through scheduled testing (tabletop, component failover, full failover).
Documenting and updating DR runbooks after tests and incidents.

Quick Start

Step 1: Define DR metrics (RTO, RPO) and select a DR strategy (Backup & Restore, Pilot Light, Warm Standby, or Multi-Site Active).
Step 2: Implement backups/replication and automate failover for the chosen strategy.
Step 3: Schedule regular DR tests (tabletop, component failover, full failover) and update runbooks after each test.

Best Practices

Regular DR testing
Automate failover where possible
Document all procedures
Update runbooks after tests
Consider cross-region replication (AWS Multi-Region) to improve resilience

Example Use Cases

A company uses Backup & Restore with weekly backups and hours-long RTO, ensuring rapid recovery from data loss.
An application implements Pilot Light with a cross-region RDS replica to enable quick restoration in another region.
A fintech app deploys Warm Standby to achieve minimal downtime and conducts frequent failover checks.
An enterprise runs Multi-Site Active across two Regions with near-zero RTO/RPO for critical services.
DR testing schedules include tabletop (quarterly), component failover (monthly), and full failover (annually) with a lessons-learned process.

Frequently Asked Questions

Add this skill to your agents