3.0 KiB
Raw Blame History

🛡️ Case Study: Bulletproofing Genesis Infrastructure with ChaosMonkey DR Drills

Date: May 10, 2025
Organization: Genesis Hosting Technologies
Lead Engineer: Doc (Genesis Radio, Infrastructure Director)


🎯 Objective

Design and validate a robust, automated disaster recovery (DR) system for Genesis infrastructure — including PostgreSQL, MinIO object storage, and ZFS-backed media — with an external testbed (Linode-hosted) named ChaosMonkey.


🧩 Infrastructure Overview

Component Role Location
PostgreSQL Primary/replica database nodes zcluster.technodrome1/2
MinIO S3-compatible object storage shredder
ZFS Primary media storage backend minioraid5, thevault
GenesisSync Hybrid mirroring and integrity check Deployed to all asset nodes
ChaosMonkey DR simulation and restore target Linode

🧰 Tools Developed

genesis_sync.sh

  • Mirrors local ZFS to MinIO and vice versa
  • Supports verification, dry-run, and audit mode
  • Alerts via KrangBot on error or drift

run_dr_failover.sh & run_dr_failback.sh

  • Safely fail over and restore PostgreSQL + GenesisSync
  • Auto-promotes DB nodes
  • Sends alerts via Telegram

genesis_clone_manager_multihost.sh

  • Clones live systems (DB, ZFS, MinIO) from prod to ChaosMonkey
  • Runs with dry-run preview mode
  • Multi-host orchestration via SSH

genesis_clone_validator.sh

  • Runs on ChaosMonkey
  • Verifies PostgreSQL snapshot, ZFS datasets, and MinIO content
  • Can optionally trigger a GenesisSync --verify

🧪 DR Drill Process (Stage 3 - Controlled Live Test)

  1. 🔒 Freeze writes on production nodes
  2. 📤 Snapshot and clone entire stack to ChaosMonkey
  3. 🔁 Promote standby PostgreSQL and redirect test traffic
  4. 🧪 Validate application behavior and data consistency
  5. 📩 Alert via KrangBot with sync/report logs
  6. Trigger safe failback using snapshot + delta sync

🚨 Results

  • Recovery time (RTO): PostgreSQL in 3 min, full app < 10 min
  • Zero data loss using basebackups and WAL
  • GenesisSync completed with verified parity between ZFS and MinIO
  • Repeatable: Same scripts reused weekly for validation

💡 Key Takeaways

  • Scripts are smarter than sleepy admins — guardrails matter
  • ZFS + WAL + GitOps-style orchestration = rock solid DR
  • Testing DR live on ChaosMonkey builds real confidence
  • Failure Friday is not a risk — its a training ground

🌟 Final Thoughts

By taking DR out of theory and into action, Genesis Hosting Technologies ensures that not only is data safe — its recoverable, testable, and fully verified on demand. With ChaosMonkey in the mix, Genesis now embraces disaster… on its own terms.