3.3 KiB
3.3 KiB
from datetime import datetime
Define the content of the post-mortem
🔧 Post-Mortem: Genesis Radio Storage Migration
Date: April 30, 2025
Prepared by: Doc
Systems Affected: StationPlaylist (SPL), Voice Tracker, Genesis Radio media backend
🧠 Executive Summary
Genesis Radio’s backend was migrated from a legacy MinIO instance using local disk (ext4) to a new ZFS-based, encrypted MinIO deployment on shredderv2
. This change was driven by a need for more stable performance, improved security, and a cleaner storage architecture with proper bucket separation.
This migration was completed without touching production until final validation, and all critical services remained online throughout the transition. We also revamped the rclone caching strategy to reduce freeze-ups and playback hiccups.
✅ What We Did
- Created three new secure buckets:
genesislibrary-secure
,genesisassets-secure
, andgenesisshows-secure
- Migrated data from backup server using
rclone sync
:genesislibrary
came directly from backupgenesisassets
andgenesisshows
were pulled from the same bucket, with de-duping and cleanup to be completed post-migration
- Retained original SPL drive letters (
Q:\\
,R:\\
) to avoid changes to the playout config - Switched rclone mounts to point to the new secure buckets, with aggressive VFS caching using SSD-backed cache directories
- Took a clean ZFS snapshot (
@pre-s3-switch
) before switching over - Confirmed no regression in SPL, VT Tracker, or streaming audio
⚙️ Technical Improvements
- VFS caching overhaul:
- Increased read-ahead (
1G
), lowered write-back wait - Split cache between
X:\\librarycache
andL:\\assetcache
- No more rclone choking on large files or freezing during transitions
- Increased read-ahead (
- Encrypted S3 storage with isolated buckets per functional role
- TLS-secured Console and MinIO endpoints with automated renewal
- Mounted buckets at startup via batch script (future systemd equivalents to be implemented)
- Snapshot-based rollback in ZFS enabled post-deployment resilience
🩹 What Went Weird (and We Fixed It)
- SPL froze during initial
mc mirror
attempts — solution: switched torclone
, which performed exponentially faster - Some hiccups during early cache tuning, including sparse file support issues — solved by switching to ZFS
- Missing media files in Mastodon were traced to uploads during sync; resolved with staged sync + retry before final switch
- Certbot automation wasn’t configured — resolved with a systemd timer that stops nginx, renews, and restarts nginx automatically
🧯 What We Learned
- MinIO is solid, but rclone wins for bulk sync performance
- VFS cache settings make or break media-heavy workloads like SPL
- ZFS is a game-changer: no sparse file errors, reliable snapshots, clean rollback
- Planning matters: pre-syncing from backup avoided downtime
- Not touching prod until ready keeps stress and screwups to a minimum
📦 Next Steps
- Clean
genesisassets-secure
of misplaced show files - Sync
azuracast
from live system (no backup copy yet) - Build automated snapshot send-to-backup workflow (
zfs send | ssh backup zfs recv
) - Stage full failover simulation (optional but fun)