diff --git a/OPS.md b/OPS.md index c8cce49..78a0608 100644 --- a/OPS.md +++ b/OPS.md @@ -16,12 +16,12 @@ When an alert fires (Critical or Warning), this guide tells you what to do so th | Type of Alert | Emoji | What it Means | Immediate Action | |:---|:---|:---|:---| -| Critical Service Failure | 🔚 | A key service (like Mastodon, MinIO) is **down** | SSH into the server, try `systemctl restart `. | -| Disk Filling Up | 📈 | Disk space critically low (under 10%) | SSH in and delete old logs/backups. Free up space **immediately**. | -| Rclone Mount Error | 🐢 | Cache failed, mount not healthy | Restart the rclone mount process. (Usually a `systemctl restart rclone@`, or remount manually.) | -| PostgreSQL Replication Lag | 💥 | Database replicas are falling behind | Check database health. Restart replication if needed. Alert admin if lag is >5 minutes. | -| RAID Degraded | 🧸 | RAID array is degraded (missing a disk) | Open server console. Identify failed drive. Replace drive if possible. Otherwise escalate immediately. | -| Log File Warnings | ⚠️ | Error patterns found in logs | Investigate. If system is healthy, **log it for later**. If errors worsen, escalate. | +| [Critical Service Failure](#critical-service-failure-) | 🔚 | A key service (like Mastodon, MinIO) is **down** | SSH into the server, try `systemctl restart `. | A key service (like Mastodon, MinIO) is **down** | SSH into the server, try `systemctl restart `. | +| [Disk Filling Up](#disk-filling-up-) | 📈 | Disk space critically low (under 10%) | SSH in and delete old logs/backups. Free up space **immediately**. | Disk space critically low (under 10%) | SSH in and delete old logs/backups. Free up space **immediately**. | +| [Rclone Mount Error](#rclone-mount-error-) | 🐢 | Cache failed, mount not healthy | Restart the rclone mount process. (Usually a `systemctl restart rclone@`, or remount manually.) | Cache failed, mount not healthy | Restart the rclone mount process. (Usually a `systemctl restart rclone@`, or remount manually.) | +| [PostgreSQL Replication Lag](#postgresql-replication-lag-) | 💥 | Database replicas are falling behind | Check database health. Restart replication if needed. Alert admin if lag is >5 minutes. | Database replicas are falling behind | Check database health. Restart replication if needed. Alert admin if lag is >5 minutes. | +| [RAID Degraded](#raid-degraded-) | 🧸 | RAID array is degraded (missing a disk) | Open server console. Identify failed drive. Replace drive if possible. Otherwise escalate immediately. | RAID array is degraded (missing a disk) | Open server console. Identify failed drive. Replace drive if possible. Otherwise escalate immediately. | +| [Log File Warnings](#log-file-warnings-) | ⚠️ | Error patterns found in logs | Investigate. If system is healthy, **log it for later**. If errors worsen, escalate. | Error patterns found in logs | Investigate. If system is healthy, **log it for later**. If errors worsen, escalate. | --- @@ -35,7 +35,7 @@ When an alert fires (Critical or Warning), this guide tells you what to do so th ## 🛡️ Emergency Contacts | Role | Name | Contact | |:----|:-----|:--------| -| Primary Admin | (You) | [YOUR CONTACT INFO] | +| Primary Admin | (You) | [845-453-0820] | | Secondary | Brice | [BRICE CONTACT INFO] | (Replace placeholders with actual contact details.)