bin-there-done-that/postmortem/mastoobjectstorage.md

76 lines
2.7 KiB
Markdown
Raw Normal View History

# 🧾 Postmortem: Mastodon Object Storage Migration to Secure S3 (MinIO)
**Date:** April 30, 2025
**Engineer:** Doc (Genesis Radio / Genesis Hosting)
---
## 🎯 Objective
Migrate Mastodon's object storage from an older MinIO bucket (`linodeassets`) to a new **ZFS-backed, encrypted** MinIO instance (`mastodonassets-secure`) on `shredderv2`, while maintaining uptime and improving storage performance and security.
---
## 🧱 Infrastructure Touched
- Mastodon (Docker-based, hosted on Linode)
- MinIO S3 Object Storage (`oldminio``secureminio`)
- Nginx (reverse proxy for Console + S3 endpoints)
- ZFS pool: `nexus/mastodonassets`
- Domains:
- `shredderv2.sshjunkie.com` (S3 API)
- `consolev2.sshjunkie.com` (MinIO Console UI)
---
## ⚠️ Issues Encountered
1. **403 Access Denied on Mastodon startup**
- ✅ Root cause: `genesisadminv2` MinIO user had no attached policy
- 🔧 Fixed via Console UI after re-enabling access
2. **MinIO Console unreachable (`consolev2.sshjunkie.com`)**
- SSL cert for the domain was missing
- 🔧 Used `certbot certonly --standalone` to issue new cert, re-enabled full HTTPS proxy
3. **Sync race conditions**
- Some media files were uploaded to the old bucket during the long transfer
- 🔧 Mitigated by running an additional `rclone sync` pass before cutover
4. **Rclone performance bottlenecks**
- MinIO client (`mc mirror`) was too slow
- ✅ Switched to `rclone`, saw drastic speed improvement
5. **SPL (StationPlaylist) freezing during asset access**
- Root cause: cache choking on sparse file writes under ext4
- ✅ Fix: moved critical rclone mounts to ZFS-backed drives
---
## ✅ Success Criteria Met
- 🔒 All Mastodon assets are now stored in `mastodonassets-secure` with encryption
- 🪣 MinIO Console functional on `https://consolev2.sshjunkie.com`
- 🎯 Mastodon is running with zero visible user impact
- 💾 Snapshot (`nexus/mastodonassets@pre-s3-switch`) taken post-migration for rollback
- 🔁 Future syncs can now be performed cleanly from backup server instead of live system
---
## 🧠 Lessons Learned
- Always validate MinIO user policies before go-live
- Avoid redirects in `server_name` blocks during cert issuance
- ZFS dramatically improves caching performance with rclone VFS
- Post-cutover syncs are crucial for active upload systems like Mastodon
- UI access to MinIO is a lifesaver for emergency fixes — keep it working
---
## 🔚 Follow-Up Actions
- [ ] Schedule `certbot renew --standalone` with systemd timer
- [ ] Rotate MinIO user keys and audit access policies
- [ ] Monitor `/var/log/syslog` for VFS or sparse file errors
- [ ] Document your rclone mount and caching strategy for SPL and Mastodon