65 lines
2.1 KiB
Markdown

# 📛 Case Study: Why RAID Is Not a Backup
## Overview
On May 4, 2025, we experienced a production data loss incident involving the `nexus` dataset on `shredderv1`, a Linux RAID5 server. Despite no hardware failure, critical files were lost due to an unintended command affecting live data.
This incident serves as a clear, real-world illustration of the maxim:
> **RAID protects against hardware failure — not human error, data corruption, or bad automation.**
---
## 🔍 What Happened
- `shredderv1` uses RAID5 for media storage.
- The dataset `nexus/miniodata` (housing `genesisassets`, `genesislibrary`, etc.) was accidentally destroyed.
- **No disks failed.** The failure was logical, not physical.
---
## 🔥 Impact
- StationPlaylist (SPL) lost access to the Genesis media library.
- MinIO bucket data was instantly inaccessible.
- Temporary outage and scrambling to reconfigure mounts, media, and streaming.
---
## ✅ Recovery
Thanks to our disaster recovery stack:
- Nightly **rsync backups** were synced to **The Vault** (backup server).
- **ZFS snapshots** existed on The Vault for the affected datasets.
- We restored the latest snapshot **from The Vault back to Shredder**, effectively reversing the loss.
- No data corruption occurred; sync validation showed dataset integrity.
---
## 🎓 Takeaway
This is a live demonstration of why:
- **RAID is not a backup**
- **Snapshots without off-host replication** are not enough
- **Real backups must be off-server and regularly tested**
---
## 🔐 Current Protection Measures
- Production data (`genesisassets`, `genesislibrary`) now replicated nightly to The Vault via `rsync`.
- ZFS snapshots are validated daily via a **dry-run restore validator**.
- Telegram alerts notify success/failure of backup verification jobs.
- Future goal: full ZFS storage on all production servers for native snapshot support.
---
## 🧠 Lessons Learned
- Always assume you'll delete the wrong thing eventually.
- Snapshots are amazing — **if** they're somewhere else.
- Automated restore testing should be part of every backup pipeline.