bin-there-done-that/postmortem/mastoobjectstorage.md

2.7 KiB

🧾 Postmortem: Mastodon Object Storage Migration to Secure S3 (MinIO)

Date: April 30, 2025
Engineer: Doc (Genesis Radio / Genesis Hosting)


🎯 Objective

Migrate Mastodon's object storage from an older MinIO bucket (linodeassets) to a new ZFS-backed, encrypted MinIO instance (mastodonassets-secure) on shredderv2, while maintaining uptime and improving storage performance and security.


🧱 Infrastructure Touched

  • Mastodon (Docker-based, hosted on Linode)
  • MinIO S3 Object Storage (oldminiosecureminio)
  • Nginx (reverse proxy for Console + S3 endpoints)
  • ZFS pool: nexus/mastodonassets
  • Domains:
    • shredderv2.sshjunkie.com (S3 API)
    • consolev2.sshjunkie.com (MinIO Console UI)

⚠️ Issues Encountered

  1. 403 Access Denied on Mastodon startup

    • Root cause: genesisadminv2 MinIO user had no attached policy
    • 🔧 Fixed via Console UI after re-enabling access
  2. MinIO Console unreachable (consolev2.sshjunkie.com)

    • SSL cert for the domain was missing
    • 🔧 Used certbot certonly --standalone to issue new cert, re-enabled full HTTPS proxy
  3. Sync race conditions

    • Some media files were uploaded to the old bucket during the long transfer
    • 🔧 Mitigated by running an additional rclone sync pass before cutover
  4. Rclone performance bottlenecks

    • MinIO client (mc mirror) was too slow
    • Switched to rclone, saw drastic speed improvement
  5. SPL (StationPlaylist) freezing during asset access

    • Root cause: cache choking on sparse file writes under ext4
    • Fix: moved critical rclone mounts to ZFS-backed drives

Success Criteria Met

  • 🔒 All Mastodon assets are now stored in mastodonassets-secure with encryption
  • 🪣 MinIO Console functional on https://consolev2.sshjunkie.com
  • 🎯 Mastodon is running with zero visible user impact
  • 💾 Snapshot (nexus/mastodonassets@pre-s3-switch) taken post-migration for rollback
  • 🔁 Future syncs can now be performed cleanly from backup server instead of live system

🧠 Lessons Learned

  • Always validate MinIO user policies before go-live
  • Avoid redirects in server_name blocks during cert issuance
  • ZFS dramatically improves caching performance with rclone VFS
  • Post-cutover syncs are crucial for active upload systems like Mastodon
  • UI access to MinIO is a lifesaver for emergency fixes — keep it working

🔚 Follow-Up Actions

  • Schedule certbot renew --standalone with systemd timer
  • Rotate MinIO user keys and audit access policies
  • Monitor /var/log/syslog for VFS or sparse file errors
  • Document your rclone mount and caching strategy for SPL and Mastodon