83 lines
3.0 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 🛡️ Case Study: Bulletproofing Genesis Infrastructure with ChaosMonkey DR Drills
**Date:** May 10, 2025
**Organization:** Genesis Hosting Technologies
**Lead Engineer:** Doc (Genesis Radio, Infrastructure Director)
---
## 🎯 Objective
Design and validate a robust, automated disaster recovery (DR) system for Genesis infrastructure — including PostgreSQL, MinIO object storage, and ZFS-backed media — with an external testbed (Linode-hosted) named **ChaosMonkey**.
---
## 🧩 Infrastructure Overview
| Component | Role | Location |
|------------------|--------------------------------------|-----------------------------|
| PostgreSQL | Primary/replica database nodes | zcluster.technodrome1/2 |
| MinIO | S3-compatible object storage | shredder |
| ZFS | Primary media storage backend | minioraid5, thevault |
| GenesisSync | Hybrid mirroring and integrity check | Deployed to all asset nodes |
| ChaosMonkey | DR simulation and restore target | Linode |
---
## 🧰 Tools Developed
### `genesis_sync.sh`
- Mirrors local ZFS to MinIO and vice versa
- Supports verification, dry-run, and audit mode
- Alerts via KrangBot on error or drift
### `run_dr_failover.sh` & `run_dr_failback.sh`
- Safely fail over and restore PostgreSQL + GenesisSync
- Auto-promotes DB nodes
- Sends alerts via Telegram
### `genesis_clone_manager_multihost.sh`
- Clones live systems (DB, ZFS, MinIO) from prod to ChaosMonkey
- Runs with dry-run preview mode
- Multi-host orchestration via SSH
### `genesis_clone_validator.sh`
- Runs on ChaosMonkey
- Verifies PostgreSQL snapshot, ZFS datasets, and MinIO content
- Can optionally trigger a GenesisSync `--verify`
---
## 🧪 DR Drill Process (Stage 3 - Controlled Live Test)
1. 🔒 Freeze writes on production nodes
2. 📤 Snapshot and clone entire stack to ChaosMonkey
3. 🔁 Promote standby PostgreSQL and redirect test traffic
4. 🧪 Validate application behavior and data consistency
5. 📩 Alert via KrangBot with sync/report logs
6. ✅ Trigger safe failback using snapshot + delta sync
---
## 🚨 Results
- **Recovery time (RTO)**: PostgreSQL in 3 min, full app < 10 min
- **Zero data loss** using basebackups and WAL
- **GenesisSync** completed with verified parity between ZFS and MinIO
- **Repeatable**: Same scripts reused weekly for validation
---
## 💡 Key Takeaways
- **Scripts are smarter than sleepy admins** guardrails matter
- **ZFS + WAL + GitOps-style orchestration = rock solid DR**
- **Testing DR live on ChaosMonkey builds real confidence**
- **Failure Friday is not a risk its a training ground**
---
## 🌟 Final Thoughts
By taking DR out of theory and into action, Genesis Hosting Technologies ensures that not only is data safe its recoverable, testable, and fully verified on demand. With ChaosMonkey in the mix, Genesis now embraces disaster on its own terms.