bin-there-done-that/postmortem/genesisradiozfsmigration.md

69 lines
3.2 KiB
Markdown
Raw Normal View History

# 🔧 Post-Mortem: Genesis Radio Storage Migration
**Date:** April 30, 2025
**Prepared by:** Doc
**Systems Affected:** StationPlaylist (SPL), Voice Tracker, Genesis Radio media backend
---
## 🧠 Executive Summary
Genesis Radios backend was migrated from a legacy MinIO instance using local disk (ext4) to a new **ZFS-based, encrypted MinIO deployment on `shredderv2`**. This change was driven by a need for more stable performance, improved security, and a cleaner storage architecture with proper bucket separation.
This migration was completed **without touching production** until final validation, and all critical services remained online throughout the transition. We also revamped the rclone caching strategy to reduce freeze-ups and playback hiccups.
---
## ✅ What We Did
- Created **three new secure buckets**: `genesislibrary-secure`, `genesisassets-secure`, and `genesisshows-secure`
- Migrated data from backup server using `rclone sync`:
- `genesislibrary` came directly from backup
- `genesisassets` and `genesisshows` were pulled from the same bucket, with de-duping and cleanup to be completed post-migration
- Retained **original SPL drive letters** (`Q:\\`, `R:\\`) to avoid changes to the playout config
- Switched rclone mounts to point to the new secure buckets, with **aggressive VFS caching** using SSD-backed cache directories
- Took a clean **ZFS snapshot** (`@pre-s3-switch`) before switching over
- Confirmed no regression in SPL, VT Tracker, or streaming audio
---
## ⚙️ Technical Improvements
- **VFS caching overhaul**:
- Increased read-ahead (`1G`), lowered write-back wait
- Split cache between `X:\\librarycache` and `L:\\assetcache`
- No more rclone choking on large files or freezing during transitions
- **Encrypted S3 storage** with isolated buckets per functional role
- **TLS-secured** Console and MinIO endpoints with automated renewal
- Mounted buckets at startup via batch script (future systemd equivalents to be implemented)
- Snapshot-based rollback in ZFS enabled post-deployment resilience
---
## 🩹 What Went Weird (and We Fixed It)
- SPL froze during initial `mc mirror` attempts — solution: switched to `rclone`, which performed exponentially faster
- Some hiccups during early cache tuning, including sparse file support issues — solved by switching to ZFS
- Missing media files in Mastodon were traced to uploads during sync; resolved with staged sync + retry before final switch
- Certbot automation wasnt configured — resolved with a systemd timer that stops nginx, renews, and restarts nginx automatically
---
## 🧯 What We Learned
- MinIO is solid, but **rclone wins for bulk sync performance**
- VFS cache settings **make or break** media-heavy workloads like SPL
- ZFS is a game-changer: no sparse file errors, reliable snapshots, clean rollback
- Planning matters: pre-syncing from backup avoided downtime
- Not touching prod until ready keeps stress and screwups to a minimum
---
## 📦 Next Steps
- [ ] Clean `genesisassets-secure` of misplaced show files
- [ ] Sync `azuracast` from live system (no backup copy yet)
- [ ] Build automated snapshot send-to-backup workflow (`zfs send | ssh backup zfs recv`)
- [ ] Stage full failover simulation (optional but fun)