📛 Case Study: Why RAID Is Not a Backup

Overview

On May 4, 2025, we experienced a production data loss incident involving the nexus dataset on shredderv1, a Linux RAID5 server. Despite no hardware failure, critical files were lost due to an unintended command affecting live data.

This incident serves as a clear, real-world illustration of the maxim:

RAID protects against hardware failure — not human error, data corruption, or bad automation.

🔍 What Happened

shredderv1 uses RAID5 for media storage.
The dataset nexus/miniodata (housing genesisassets, genesislibrary, etc.) was accidentally destroyed.
No disks failed. The failure was logical, not physical.

🔥 Impact

StationPlaylist (SPL) lost access to the Genesis media library.
MinIO bucket data was instantly inaccessible.
Temporary outage and scrambling to reconfigure mounts, media, and streaming.

✅ Recovery

Thanks to our disaster recovery stack:

Nightly rsync backups were synced to The Vault (backup server).
ZFS snapshots existed on The Vault for the affected datasets.
We restored the latest snapshot from The Vault back to Shredder, effectively reversing the loss.
No data corruption occurred; sync validation showed dataset integrity.

🎓 Takeaway

This is a live demonstration of why:

RAID is not a backup
Snapshots without off-host replication are not enough
Real backups must be off-server and regularly tested

🔐 Current Protection Measures

Production data (genesisassets, genesislibrary) now replicated nightly to The Vault via rsync.
ZFS snapshots are validated daily via a dry-run restore validator.
Telegram alerts notify success/failure of backup verification jobs.
Future goal: full ZFS storage on all production servers for native snapshot support.

🧠 Lessons Learned

Always assume you'll delete the wrong thing eventually.
Snapshots are amazing — if they're somewhere else.
Automated restore testing should be part of every backup pipeline.

2.1 KiB Raw Blame History