From fc66fc65c909618406ee6fca9c5a97595fa0f5ce Mon Sep 17 00:00:00 2001 From: DocTator Date: Thu, 1 May 2025 10:38:49 -0400 Subject: [PATCH] Auto-commit from giteapush.sh at 2025-05-01 10:38:49 --- incident_response.md | 128 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 128 insertions(+) create mode 100644 incident_response.md diff --git a/incident_response.md b/incident_response.md new file mode 100644 index 0000000..412f671 --- /dev/null +++ b/incident_response.md @@ -0,0 +1,128 @@ +# ⚠ïļ Incident Response Checklists for Common Failures + +These checklists are designed to normalize responses and reduce stress during downtime in your infrastructure. + +--- + +## 🔌 Node Reboot or Power Loss + +- [ ] Verify ZFS pools are imported: `zpool status` +- [ ] Check all ZFS mounts: `mount | grep /mnt` +- [ ] Confirm Proxmox VM auto-start behavior +- [ ] Validate system services: PostgreSQL, Mastodon, MinIO, etc. +- [ ] Run `genesis-tools/healthcheck.sh` or equivalent + +--- + +## 🐘 PostgreSQL Database Failure + +- [ ] Ping cluster VIP +- [ ] Check replication lag: `pg_stat_replication` +- [ ] Inspect ClusterControl / Patroni node status +- [ ] Verify HAProxy is routing to correct primary +- [ ] If failover occurred, verify application connections + +--- + +## 🌐 Network Drop or Routing Issue + +- [ ] Check interface status: `ip a`, `nmcli` +- [ ] Ping gateway and internal/external hosts +- [ ] Test inter-VM connectivity +- [ ] Inspect HAProxy or Keepalived logs for failover triggers +- [ ] Validate DNS and NTP services are accessible + +--- + +## ðŸ“Ķ Object Storage Outage (MinIO / rclone) + +- [ ] Confirm rclone mounts: `mount | grep rclone` +- [ ] View VFS cache stats: `rclone rc vfs/stats` +- [ ] Verify MinIO service and disk health +- [ ] Check cache disk space: `df -h` +- [ ] Restart rclone mounts if needed + +--- + +## 🧠 Split Brain in PostgreSQL Cluster (ClusterControl) + +### Symptoms: +- Two nodes think they're primary +- WAL timelines diverge +- Errors in ClusterControl, or inconsistent data in apps + +### Immediate Actions: +- [ ] Use `pg_controldata` to verify state and timeline on both nodes +- [ ] Temporarily pause failover automation +- [ ] Identify true primary (most recent WAL, longest uptime, etc.) +- [ ] Stop false primary immediately: `systemctl stop postgresql` + +### Fix the Broken Replica: +- [ ] Rebuild broken node: + ```bash + pg_basebackup -h -D /var/lib/postgresql/XX/main -U replication -P --wal-method=stream + ``` +- [ ] Restart replication and confirm sync + +### Post-Mortem: +- [ ] Audit any split writes for data integrity +- [ ] Review Keepalived/HAProxy fencing logic +- [ ] Add dual-primary alerts with `pg_is_in_recovery()` checks +- [ ] Document findings and update HA policies + +--- + +## 🐘 PostgreSQL Replication Lag / Sync Delay + +- [ ] Query replication status: + ```sql + SELECT client_addr, state, sync_state, sent_lsn, write_lsn, flush_lsn, replay_lsn FROM pg_stat_replication; + ``` +- [ ] Compare LSNs for lag +- [ ] Check for disk I/O, CPU, or network bottlenecks +- [ ] Ensure WAL retention and streaming are healthy +- [ ] Restart replica or sync service if needed + +--- + +## ðŸŠĶ MinIO Bucket Inaccessibility or Failure + +- [ ] Run `mc admin info local` to check node status +- [ ] Confirm MinIO access credentials/environment +- [ ] Check rclone and MinIO logs +- [ ] Restart MinIO service: `systemctl restart minio` +- [ ] Check storage backend health/mounts + +--- + +## ðŸģ Dockerized Service Crash (e.g., AzuraCast) + +- [ ] Inspect containers: `docker ps -a` +- [ ] View logs: `docker logs ` +- [ ] Check disk space: `df -h` +- [ ] Restart with Docker or Compose: + ```bash + docker restart + docker-compose down && docker-compose up -d + ``` + +--- + +## 🔒 Fail2Ban or Genesis Shield Alert Triggered + +- [ ] Tail logs: + ```bash + journalctl -u fail2ban + tail -f /var/log/fail2ban.log + ``` +- [ ] Inspect logs for false positives +- [ ] Unban IP if needed: + ```bash + fail2ban-client set unbanip + ``` +- [ ] Notify via Mastodon/Telegram alert system +- [ ] Tune jail thresholds or IP exemptions + +--- + +> ✅ Store these in a Gitea wiki or `/root/checklists/` for quick access under pressure.