# πŸ›‘οΈ Case Study: Bulletproofing Genesis Infrastructure with ChaosMonkey DR Drills **Date:** May 10, 2025 **Organization:** Genesis Hosting Technologies **Lead Engineer:** Doc (Genesis Radio, Infrastructure Director) --- ## 🎯 Objective Design and validate a robust, automated disaster recovery (DR) system for Genesis infrastructure β€” including PostgreSQL, MinIO object storage, and ZFS-backed media β€” with an external testbed (Linode-hosted) named **ChaosMonkey**. --- ## 🧩 Infrastructure Overview | Component | Role | Location | |------------------|--------------------------------------|-----------------------------| | PostgreSQL | Primary/replica database nodes | zcluster.technodrome1/2 | | MinIO | S3-compatible object storage | shredder | | ZFS | Primary media storage backend | minioraid5, thevault | | GenesisSync | Hybrid mirroring and integrity check | Deployed to all asset nodes | | ChaosMonkey | DR simulation and restore target | Linode | --- ## 🧰 Tools Developed ### `genesis_sync.sh` - Mirrors local ZFS to MinIO and vice versa - Supports verification, dry-run, and audit mode - Alerts via KrangBot on error or drift ### `run_dr_failover.sh` & `run_dr_failback.sh` - Safely fail over and restore PostgreSQL + GenesisSync - Auto-promotes DB nodes - Sends alerts via Telegram ### `genesis_clone_manager_multihost.sh` - Clones live systems (DB, ZFS, MinIO) from prod to ChaosMonkey - Runs with dry-run preview mode - Multi-host orchestration via SSH ### `genesis_clone_validator.sh` - Runs on ChaosMonkey - Verifies PostgreSQL snapshot, ZFS datasets, and MinIO content - Can optionally trigger a GenesisSync `--verify` --- ## πŸ§ͺ DR Drill Process (Stage 3 - Controlled Live Test) 1. πŸ”’ Freeze writes on production nodes 2. πŸ“€ Snapshot and clone entire stack to ChaosMonkey 3. πŸ” Promote standby PostgreSQL and redirect test traffic 4. πŸ§ͺ Validate application behavior and data consistency 5. πŸ“© Alert via KrangBot with sync/report logs 6. βœ… Trigger safe failback using snapshot + delta sync --- ## 🚨 Results - **Recovery time (RTO)**: PostgreSQL in 3 min, full app < 10 min - **Zero data loss** using basebackups and WAL - **GenesisSync** completed with verified parity between ZFS and MinIO - **Repeatable**: Same scripts reused weekly for validation --- ## πŸ’‘ Key Takeaways - **Scripts are smarter than sleepy admins** β€” guardrails matter - **ZFS + WAL + GitOps-style orchestration = rock solid DR** - **Testing DR live on ChaosMonkey builds real confidence** - **Failure Friday is not a risk β€” it’s a training ground** --- ## 🌟 Final Thoughts By taking DR out of theory and into action, Genesis Hosting Technologies ensures that not only is data safe β€” it’s recoverable, testable, and fully verified on demand. With ChaosMonkey in the mix, Genesis now embraces disaster… on its own terms.