65 lines
2.1 KiB
Markdown
65 lines
2.1 KiB
Markdown
# 📛 Case Study: Why RAID Is Not a Backup
|
|
|
|
## Overview
|
|
|
|
On May 4, 2025, we experienced a production data loss incident involving the `nexus` dataset on `shredderv1`, a Linux RAID5 server. Despite no hardware failure, critical files were lost due to an unintended command affecting live data.
|
|
|
|
This incident serves as a clear, real-world illustration of the maxim:
|
|
|
|
> **RAID protects against hardware failure — not human error, data corruption, or bad automation.**
|
|
|
|
---
|
|
|
|
## 🔍 What Happened
|
|
|
|
- `shredderv1` uses RAID5 for media storage.
|
|
- The dataset `nexus/miniodata` (housing `genesisassets`, `genesislibrary`, etc.) was accidentally destroyed.
|
|
- **No disks failed.** The failure was logical, not physical.
|
|
|
|
---
|
|
|
|
## 🔥 Impact
|
|
|
|
- StationPlaylist (SPL) lost access to the Genesis media library.
|
|
- MinIO bucket data was instantly inaccessible.
|
|
- Temporary outage and scrambling to reconfigure mounts, media, and streaming.
|
|
|
|
---
|
|
|
|
## ✅ Recovery
|
|
|
|
Thanks to our disaster recovery stack:
|
|
|
|
- Nightly **rsync backups** were synced to **The Vault** (backup server).
|
|
- **ZFS snapshots** existed on The Vault for the affected datasets.
|
|
- We restored the latest snapshot **from The Vault back to Shredder**, effectively reversing the loss.
|
|
- No data corruption occurred; sync validation showed dataset integrity.
|
|
|
|
---
|
|
|
|
## 🎓 Takeaway
|
|
|
|
This is a live demonstration of why:
|
|
|
|
- **RAID is not a backup**
|
|
- **Snapshots without off-host replication** are not enough
|
|
- **Real backups must be off-server and regularly tested**
|
|
|
|
---
|
|
|
|
## 🔐 Current Protection Measures
|
|
|
|
- Production data (`genesisassets`, `genesislibrary`) now replicated nightly to The Vault via `rsync`.
|
|
- ZFS snapshots are validated daily via a **dry-run restore validator**.
|
|
- Telegram alerts notify success/failure of backup verification jobs.
|
|
- Future goal: full ZFS storage on all production servers for native snapshot support.
|
|
|
|
---
|
|
|
|
## 🧠 Lessons Learned
|
|
|
|
- Always assume you'll delete the wrong thing eventually.
|
|
- Snapshots are amazing — **if** they're somewhere else.
|
|
- Automated restore testing should be part of every backup pipeline.
|
|
|