From 10a9c44a4eb8c7c4c485c40d6445660a8d98b209 Mon Sep 17 00:00:00 2001 From: doctator Date: Thu, 1 May 2025 10:01:49 +0000 Subject: [PATCH] =?UTF-8?q?Add=20=F0=9F=9A=80=20Genesis=20Radio=20-=20Heal?= =?UTF-8?q?thcheck=20Response=20Runbook?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ... Radio - Healthcheck Response Runbook.-.md | 153 ++++++++++++++++++ 1 file changed, 153 insertions(+) create mode 100644 %F0%9F%9A%80 Genesis Radio - Healthcheck Response Runbook.-.md diff --git a/%F0%9F%9A%80 Genesis Radio - Healthcheck Response Runbook.-.md b/%F0%9F%9A%80 Genesis Radio - Healthcheck Response Runbook.-.md new file mode 100644 index 0000000..df597c7 --- /dev/null +++ b/%F0%9F%9A%80 Genesis Radio - Healthcheck Response Runbook.-.md @@ -0,0 +1,153 @@ +Welcome to the Wiki.# +## Purpose +When an alert fires (Critical or Warning), this guide tells you what to do so that **any team member** can react quickly, even if the admin is not available. + +--- + +## 🛠️ How to Use +- Every Mastodon DM or Dashboard alert gives you a **timestamp**, **server name**, and **issue**. +- Look up the type of issue in the table below. +- Follow the recommended action immediately. + +--- + +## 📋 Quick Response Table + +| Type of Alert | Emoji | What it Means | Immediate Action | +|:---|:---|:---|:---| +| [Critical Service Failure](#critical-service-failure-) | 🔚 | A key service (like Mastodon, MinIO) is **down** | SSH into the server, try `systemctl restart `. | A key service (like Mastodon, MinIO) is **down** | SSH into the server, try `systemctl restart `. | +| [Disk Filling Up](#disk-filling-up-) | 📈 | Disk space critically low (under 10%) | SSH in and delete old logs/backups. Free up space **immediately**. | Disk space critically low (under 10%) | SSH in and delete old logs/backups. Free up space **immediately**. | +| [Rclone Mount Error](#rclone-mount-error-) | 🐢 | Cache failed, mount not healthy | Restart the rclone mount process. (Usually a `systemctl restart rclone@`, or remount manually.) | Cache failed, mount not healthy | Restart the rclone mount process. (Usually a `systemctl restart rclone@`, or remount manually.) | +| [PostgreSQL Replication Lag](#postgresql-replication-lag-) | 💥 | Database replicas are falling behind | Check database health. Restart replication if needed. Alert admin if lag is >5 minutes. | Database replicas are falling behind | Check database health. Restart replication if needed. Alert admin if lag is >5 minutes. | +| [RAID Degraded](#raid-degraded-) | 🧸 | RAID array is degraded (missing a disk) | Open server console. Identify failed drive. Replace drive if possible. Otherwise escalate immediately. | RAID array is degraded (missing a disk) | Open server console. Identify failed drive. Replace drive if possible. Otherwise escalate immediately. | +| [Log File Warnings](#log-file-warnings-) | ⚠️ | Error patterns found in logs | Investigate. If system is healthy, **log it for later**. If errors worsen, escalate. | Error patterns found in logs | Investigate. If system is healthy, **log it for later**. If errors worsen, escalate. | + +--- + +## 💻 If Dashboard Shows +- ✅ **All Green** = No action needed. +- ⚠️ **Warnings** = Investigate soon. Not urgent unless repeated. +- 🚨 **Criticals** = Drop everything and act immediately. + +--- + +## 🛡️ Emergency Contacts +| Role | Name | Contact | +|:----|:-----|:--------| +| Primary Admin | (You) | [845-453-0820] | +| Secondary | Brice | [BRICE CONTACT INFO] | + +(Replace placeholders with actual contact details.) + +--- + +## ✍️ Example Cheat Sheet for Brice + +**Sample Mastodon DM:** +> 🚨 Genesis Radio Critical Healthcheck 2025-04-28 14:22:33 🚨 +> ⚡ 1 critical issue found: +> - 🔚 [mastodon] CRITICAL: Service mastodon-web not running! + +**Brice should:** +1. SSH into Mastodon server. +2. Run `systemctl restart mastodon-web`. +3. Confirm the service is running again. +4. If it fails or stays down, escalate to admin. + +--- + +# 🌟 TL;DR +- 🚨 Criticals: Act immediately. +- ⚠️ Warnings: Investigate soon. +- ✅ Healthy: No action needed. + +--- + +# 🛠️ Genesis Radio - Detailed Ops Playbook + +## Critical Service Failure (🔚) +**Symptoms:** Service marked as CRITICAL. + +**Fix:** +1. SSH into server. +2. `sudo systemctl status ` +3. `sudo systemctl restart ` +4. Confirm running. Check logs if it fails. + +--- + +## Disk Filling Up (📈) +**Symptoms:** Disk space critically low. + +**Fix:** +1. SSH into server. +2. `df -h` +3. Delete old logs: + ```bash + sudo rm -rf /var/log/*.gz /var/log/*.[0-9] + sudo journalctl --vacuum-time=2d + ``` +4. If still low, find big files and clean. + +--- + +## Rclone Mount Error (🐢) +**Symptoms:** Mount failure or slowness. + +**Fix:** +1. SSH into SPL server. +2. Unmount & remount: + ```bash + sudo fusermount -uz /path/to/mount + sudo systemctl restart rclone@ + ``` +3. Confirm mount is active. + +--- + +## PostgreSQL Replication Lag (💥) +**Symptoms:** Replica database lagging. + +**Fix:** +1. SSH into replica server. +2. Check lag: + ```bash + sudo -u postgres psql -c "SELECT * FROM pg_stat_replication;" + ``` +3. Restart PostgreSQL if stuck. +4. Monitor replication logs. + +--- + +## RAID Degraded (🧸) +**Symptoms:** RAID missing a disk. + +**Fix:** +1. SSH into server. +2. `cat /proc/mdstat` +3. Find failed drive: + ```bash + sudo mdadm --detail /dev/md0 + ``` +4. Replace failed disk, rebuild array: + ```bash + sudo mdadm --add /dev/md0 /dev/sdX + ``` + +--- + +## Log File Warnings (⚠️) +**Symptoms:** Errors in syslog or nginx. + +**Fix:** +1. SSH into server. +2. Review logs: + ```bash + grep ERROR /var/log/syslog + ``` +3. Investigate. Escalate if necessary. + +--- + +**Stay sharp. Early fixes prevent major downtime!** 🛡️💪 +