69 lines
2.5 KiB
Markdown
Raw Normal View History

# 🚀 Genesis Radio - Healthcheck Response Runbook
## Purpose
When an alert fires (Critical or Warning), this guide tells you what to do so that anyone can react quickly, even if the admin is not available.
---
## 🛠️ How to Use
- Every Mastodon DM or Dashboard alert gives you a **timestamp**, **server name**, and **issue**.
- Look up the type of issue in the table below.
- Follow the recommended action immediately.
---
## 📋 Quick Response Table
| Type of Alert | Emoji | What it Means | Immediate Action |
|:---|:---|:---|:---|
| Critical Service Failure | 🔚 | A key service (like Mastodon, MinIO) is **down** | SSH into the server, try `systemctl restart <service>`. |
| Disk Filling Up | 📈 | Disk space critically low (under 10%) | SSH in and delete old logs/backups. Free up space **immediately**. |
| Rclone Mount Error | 🐢 | Cache failed, mount not healthy | Restart the rclone mount process. (Usually a `systemctl restart rclone@<mount>`, or remount manually.) |
| PostgreSQL Replication Lag | 💥 | Database replicas are falling behind | Check database health. Restart replication if needed. Alert admin if lag is >5 minutes. |
| RAID Degraded | 🧸 | RAID array is degraded (missing a disk) | Open server console. Identify failed drive. Replace drive if possible. Otherwise escalate immediately. |
| Log File Warnings | ⚠️ | Error patterns found in logs | Investigate. If system is healthy, **log it for later**. If errors worsen, escalate. |
---
## 💻 If Dashboard Shows
-**All Green** = No action needed.
- ⚠️ **Warnings** = Investigate soon. Not urgent unless repeated.
- 🚨 **Criticals** = Drop everything and act immediately.
---
## 🛡️ Emergency Contacts
| Role | Name | Contact |
|:----|:-----|:--------|
| Primary Admin | (You) | [YOUR CONTACT INFO] |
| Secondary | Brice | [BRICE CONTACT INFO] |
(Replace placeholders with actual contact details.)
---
## ✍️ Example Cheat Sheet for Brice
**Sample Mastodon DM:**
> 🚨 Genesis Radio Critical Healthcheck 2025-04-28 14:22:33 🚨
> ⚡ 1 critical issue found:
> - 🔚 [mastodon] CRITICAL: Service mastodon-web not running!
**Brice should:**
1. SSH into Mastodon server.
2. Run `systemctl restart mastodon-web`.
3. Confirm the service is running again.
4. If it fails or stays down, escalate to admin.
---
# 🌟 TL;DR
- 🚨 Criticals: Act immediately.
- ⚠️ Warnings: Investigate soon.
- ✅ Healthy: No action needed.
---
**Stay sharp. Our uptime and service quality depend on quick, calm responses!** 🛡️💪