bin-there-done-that/postmortem/genesisradiozfsmigration.md

70 lines
3.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Define the content of the post-mortem
# 🔧 Post-Mortem: Genesis Radio Storage Migration
**Date:** April 30, 2025
**Prepared by:** Doc
**Systems Affected:** StationPlaylist (SPL), Voice Tracker, Genesis Radio media backend
---
## 🧠 Executive Summary
Genesis Radios backend was migrated from a legacy MinIO instance using local disk (ext4) to a new **ZFS-based, encrypted MinIO deployment on `shredderv2`**. This change was driven by a need for more stable performance, improved security, and a cleaner storage architecture with proper bucket separation.
This migration was completed **without touching production** until final validation, and all critical services remained online throughout the transition. We also revamped the rclone caching strategy to reduce freeze-ups and playback hiccups.
---
## ✅ What We Did
- Created **three new secure buckets**: `genesislibrary-secure`, `genesisassets-secure`, and `genesisshows-secure`
- Migrated data from backup server using `rclone sync`:
- `genesislibrary` came directly from backup
- `genesisassets` and `genesisshows` were pulled from the same bucket, with de-duping and cleanup to be completed post-migration
- Retained **original SPL drive letters** (`Q:\\`, `R:\\`) to avoid changes to the playout config
- Switched rclone mounts to point to the new secure buckets, with **aggressive VFS caching** using SSD-backed cache directories
- Took a clean **ZFS snapshot** (`@pre-s3-switch`) before switching over
- Confirmed no regression in SPL, VT Tracker, or streaming audio
---
## ⚙️ Technical Improvements
- **VFS caching overhaul**:
- Increased read-ahead (`1G`), lowered write-back wait
- Split cache between `X:\\librarycache` and `L:\\assetcache`
- No more rclone choking on large files or freezing during transitions
- **Encrypted S3 storage** with isolated buckets per functional role
- **TLS-secured** Console and MinIO endpoints with automated renewal
- Mounted buckets at startup via batch script (future systemd equivalents to be implemented)
- Snapshot-based rollback in ZFS enabled post-deployment resilience
---
## 🩹 What Went Weird (and We Fixed It)
- SPL froze during initial `mc mirror` attempts — solution: switched to `rclone`, which performed exponentially faster
- Some hiccups during early cache tuning, including sparse file support issues — solved by switching to ZFS
- Missing media files in Mastodon were traced to uploads during sync; resolved with staged sync + retry before final switch
- Certbot automation wasnt configured — resolved with a systemd timer that stops nginx, renews, and restarts nginx automatically
---
## 🧯 What We Learned
- MinIO is solid, but **rclone wins for bulk sync performance**
- VFS cache settings **make or break** media-heavy workloads like SPL
- ZFS is a game-changer: no sparse file errors, reliable snapshots, clean rollback
- Planning matters: pre-syncing from backup avoided downtime
- Not touching prod until ready keeps stress and screwups to a minimum
---
## 📦 Next Steps
- [ ] Clean `genesisassets-secure` of misplaced show files
- [ ] Sync `azuracast` from live system (no backup copy yet)
- [ ] Build automated snapshot send-to-backup workflow (`zfs send | ssh backup zfs recv`)
- [ ] Stage full failover simulation (optional but fun)