From 3dfb59ace83e8b841ec369314cdca48329219d4d Mon Sep 17 00:00:00 2001 From: DocTator Date: Sat, 10 May 2025 11:24:30 -0400 Subject: [PATCH] Auto-commit from giteapush.sh at 2025-05-10 11:24:30 --- documents/casestudies/chaosmonkey.md | 82 ++++++++++++++++++++++++++++ 1 file changed, 82 insertions(+) create mode 100644 documents/casestudies/chaosmonkey.md diff --git a/documents/casestudies/chaosmonkey.md b/documents/casestudies/chaosmonkey.md new file mode 100644 index 0000000..490d7fb --- /dev/null +++ b/documents/casestudies/chaosmonkey.md @@ -0,0 +1,82 @@ +# πŸ›‘οΈ Case Study: Bulletproofing Genesis Infrastructure with ChaosMonkey DR Drills + +**Date:** May 10, 2025 +**Organization:** Genesis Hosting Technologies +**Lead Engineer:** Doc (Genesis Radio, Infrastructure Director) + +--- + +## 🎯 Objective + +Design and validate a robust, automated disaster recovery (DR) system for Genesis infrastructure β€” including PostgreSQL, MinIO object storage, and ZFS-backed media β€” with an external testbed (Linode-hosted) named **ChaosMonkey**. + +--- + +## 🧩 Infrastructure Overview + +| Component | Role | Location | +|------------------|--------------------------------------|-----------------------------| +| PostgreSQL | Primary/replica database nodes | zcluster.technodrome1/2 | +| MinIO | S3-compatible object storage | shredder | +| ZFS | Primary media storage backend | minioraid5, thevault | +| GenesisSync | Hybrid mirroring and integrity check | Deployed to all asset nodes | +| ChaosMonkey | DR simulation and restore target | Linode | + +--- + +## 🧰 Tools Developed + +### `genesis_sync.sh` +- Mirrors local ZFS to MinIO and vice versa +- Supports verification, dry-run, and audit mode +- Alerts via KrangBot on error or drift + +### `run_dr_failover.sh` & `run_dr_failback.sh` +- Safely fail over and restore PostgreSQL + GenesisSync +- Auto-promotes DB nodes +- Sends alerts via Telegram + +### `genesis_clone_manager_multihost.sh` +- Clones live systems (DB, ZFS, MinIO) from prod to ChaosMonkey +- Runs with dry-run preview mode +- Multi-host orchestration via SSH + +### `genesis_clone_validator.sh` +- Runs on ChaosMonkey +- Verifies PostgreSQL snapshot, ZFS datasets, and MinIO content +- Can optionally trigger a GenesisSync `--verify` + +--- + +## πŸ§ͺ DR Drill Process (Stage 3 - Controlled Live Test) + +1. πŸ”’ Freeze writes on production nodes +2. πŸ“€ Snapshot and clone entire stack to ChaosMonkey +3. πŸ” Promote standby PostgreSQL and redirect test traffic +4. πŸ§ͺ Validate application behavior and data consistency +5. πŸ“© Alert via KrangBot with sync/report logs +6. βœ… Trigger safe failback using snapshot + delta sync + +--- + +## 🚨 Results + +- **Recovery time (RTO)**: PostgreSQL in 3 min, full app < 10 min +- **Zero data loss** using basebackups and WAL +- **GenesisSync** completed with verified parity between ZFS and MinIO +- **Repeatable**: Same scripts reused weekly for validation + +--- + +## πŸ’‘ Key Takeaways + +- **Scripts are smarter than sleepy admins** β€” guardrails matter +- **ZFS + WAL + GitOps-style orchestration = rock solid DR** +- **Testing DR live on ChaosMonkey builds real confidence** +- **Failure Friday is not a risk β€” it’s a training ground** + +--- + +## 🌟 Final Thoughts + +By taking DR out of theory and into action, Genesis Hosting Technologies ensures that not only is data safe β€” it’s recoverable, testable, and fully verified on demand. With ChaosMonkey in the mix, Genesis now embraces disaster… on its own terms.