doctator/bin-there-done-that

Fork 0

DocTator 7ea6dda614 Auto-commit from giteapush.sh at 2025-05-09 16:35:15

2025-05-09 16:35:15 -04:00

3.6 KiB

Raw Blame History

⚠️ Incident Response Checklists for Common Failures

These checklists are designed to normalize responses and reduce stress during downtime in your infrastructure.

🔌 Node Reboot or Power Loss

Verify ZFS pools are imported: zpool status
Check all ZFS mounts: mount | grep /mnt
Confirm Proxmox VM auto-start behavior
Validate system services: PostgreSQL, Mastodon, MinIO, etc.
Run genesis-tools/healthcheck.sh or equivalent

🐘 PostgreSQL Database Failure

Ping cluster VIP
Check replication lag: pg_stat_replication
Inspect ClusterControl / Patroni node status
Verify HAProxy is routing to correct primary
If failover occurred, verify application connections

🌐 Network Drop or Routing Issue

Check interface status: ip a, nmcli
Ping gateway and internal/external hosts
Test inter-VM connectivity
Inspect HAProxy or Keepalived logs for failover triggers
Validate DNS and NTP services are accessible

📦 Object Storage Outage (MinIO / rclone)

Confirm rclone mounts: mount | grep rclone
View VFS cache stats: rclone rc vfs/stats
Verify MinIO service and disk health
Check cache disk space: df -h
Restart rclone mounts if needed

🧠 Split Brain in PostgreSQL Cluster (ClusterControl)

Symptoms:

Two nodes think they're primary
WAL timelines diverge
Errors in ClusterControl, or inconsistent data in apps

Immediate Actions:

Use pg_controldata to verify state and timeline on both nodes
Temporarily pause failover automation
Identify true primary (most recent WAL, longest uptime, etc.)
Stop false primary immediately: systemctl stop postgresql

Fix the Broken Replica:

Rebuild broken node:

pg_basebackup -h <true-primary> -D /var/lib/postgresql/XX/main -U replication -P --wal-method=stream

Restart replication and confirm sync

Post-Mortem:

Audit any split writes for data integrity
Review Keepalived/HAProxy fencing logic
Add dual-primary alerts with pg_is_in_recovery() checks
Document findings and update HA policies

🐘 PostgreSQL Replication Lag / Sync Delay

Query replication status:

SELECT client_addr, state, sync_state, sent_lsn, write_lsn, flush_lsn, replay_lsn FROM pg_stat_replication;

Compare LSNs for lag
Check for disk I/O, CPU, or network bottlenecks
Ensure WAL retention and streaming are healthy
Restart replica or sync service if needed

🪦 MinIO Bucket Inaccessibility or Failure

Run mc admin info local to check node status
Confirm MinIO access credentials/environment
Check rclone and MinIO logs
Restart MinIO service: systemctl restart minio
Check storage backend health/mounts

🐳 Dockerized Service Crash (e.g., AzuraCast)

Inspect containers: docker ps -a
View logs: docker logs <container>
Check disk space: df -h

Restart with Docker or Compose:

docker restart <container>
docker-compose down && docker-compose up -d

🔒 Fail2Ban or Genesis Shield Alert Triggered

Tail logs:

journalctl -u fail2ban
tail -f /var/log/fail2ban.log

Inspect logs for false positives

Unban IP if needed:

fail2ban-client set <jail> unbanip <ip>

Notify via Mastodon/Telegram alert system
Tune jail thresholds or IP exemptions

✅ Store these in a Gitea wiki or /root/checklists/ for quick access under pressure.

3.6 KiB Raw Blame History