3.0 KiB
3.0 KiB
🛡️ Case Study: Bulletproofing Genesis Infrastructure with ChaosMonkey DR Drills
Date: May 10, 2025
Organization: Genesis Hosting Technologies
Lead Engineer: Doc (Genesis Radio, Infrastructure Director)
🎯 Objective
Design and validate a robust, automated disaster recovery (DR) system for Genesis infrastructure — including PostgreSQL, MinIO object storage, and ZFS-backed media — with an external testbed (Linode-hosted) named ChaosMonkey.
🧩 Infrastructure Overview
Component | Role | Location |
---|---|---|
PostgreSQL | Primary/replica database nodes | zcluster.technodrome1/2 |
MinIO | S3-compatible object storage | shredder |
ZFS | Primary media storage backend | minioraid5, thevault |
GenesisSync | Hybrid mirroring and integrity check | Deployed to all asset nodes |
ChaosMonkey | DR simulation and restore target | Linode |
🧰 Tools Developed
genesis_sync.sh
- Mirrors local ZFS to MinIO and vice versa
- Supports verification, dry-run, and audit mode
- Alerts via KrangBot on error or drift
run_dr_failover.sh
& run_dr_failback.sh
- Safely fail over and restore PostgreSQL + GenesisSync
- Auto-promotes DB nodes
- Sends alerts via Telegram
genesis_clone_manager_multihost.sh
- Clones live systems (DB, ZFS, MinIO) from prod to ChaosMonkey
- Runs with dry-run preview mode
- Multi-host orchestration via SSH
genesis_clone_validator.sh
- Runs on ChaosMonkey
- Verifies PostgreSQL snapshot, ZFS datasets, and MinIO content
- Can optionally trigger a GenesisSync
--verify
🧪 DR Drill Process (Stage 3 - Controlled Live Test)
- 🔒 Freeze writes on production nodes
- 📤 Snapshot and clone entire stack to ChaosMonkey
- 🔁 Promote standby PostgreSQL and redirect test traffic
- 🧪 Validate application behavior and data consistency
- 📩 Alert via KrangBot with sync/report logs
- ✅ Trigger safe failback using snapshot + delta sync
🚨 Results
- Recovery time (RTO): PostgreSQL in 3 min, full app < 10 min
- Zero data loss using basebackups and WAL
- GenesisSync completed with verified parity between ZFS and MinIO
- Repeatable: Same scripts reused weekly for validation
💡 Key Takeaways
- Scripts are smarter than sleepy admins — guardrails matter
- ZFS + WAL + GitOps-style orchestration = rock solid DR
- Testing DR live on ChaosMonkey builds real confidence
- Failure Friday is not a risk — it’s a training ground
🌟 Final Thoughts
By taking DR out of theory and into action, Genesis Hosting Technologies ensures that not only is data safe — it’s recoverable, testable, and fully verified on demand. With ChaosMonkey in the mix, Genesis now embraces disaster… on its own terms.