2.1 KiB
2.1 KiB
📛 Case Study: Why RAID Is Not a Backup
Overview
On May 4, 2025, we experienced a production data loss incident involving the nexus
dataset on shredderv1
, a Linux RAID5 server. Despite no hardware failure, critical files were lost due to an unintended command affecting live data.
This incident serves as a clear, real-world illustration of the maxim:
RAID protects against hardware failure — not human error, data corruption, or bad automation.
🔍 What Happened
shredderv1
uses RAID5 for media storage.- The dataset
nexus/miniodata
(housinggenesisassets
,genesislibrary
, etc.) was accidentally destroyed. - No disks failed. The failure was logical, not physical.
🔥 Impact
- StationPlaylist (SPL) lost access to the Genesis media library.
- MinIO bucket data was instantly inaccessible.
- Temporary outage and scrambling to reconfigure mounts, media, and streaming.
✅ Recovery
Thanks to our disaster recovery stack:
- Nightly rsync backups were synced to The Vault (backup server).
- ZFS snapshots existed on The Vault for the affected datasets.
- We restored the latest snapshot from The Vault back to Shredder, effectively reversing the loss.
- No data corruption occurred; sync validation showed dataset integrity.
🎓 Takeaway
This is a live demonstration of why:
- RAID is not a backup
- Snapshots without off-host replication are not enough
- Real backups must be off-server and regularly tested
🔐 Current Protection Measures
- Production data (
genesisassets
,genesislibrary
) now replicated nightly to The Vault viarsync
. - ZFS snapshots are validated daily via a dry-run restore validator.
- Telegram alerts notify success/failure of backup verification jobs.
- Future goal: full ZFS storage on all production servers for native snapshot support.
🧠 Lessons Learned
- Always assume you'll delete the wrong thing eventually.
- Snapshots are amazing — if they're somewhere else.
- Automated restore testing should be part of every backup pipeline.