Alert response
Audience
PLANA staff on-call. Customer-facing alerts (from BOS) are handled separately — see BOS → Alerts.
Platform alerts fire from Prometheus + Alertmanager into the #alerts Matrix room. PLANA does not page over email or SMS. This page covers the runbooks for the most common alerts and the general triage process.
Where alerts land
| Channel | Where |
|---|---|
| Matrix room | #alerts on matrix.planapulse.com |
| Optional pager | PagerDuty for Enterprise tier on-call (P0 only) |
All staff are in #alerts by default. Critical alerts auto-mention the on-call.
The triage process
When an alert fires:
- Acknowledge in
#alerts— type "ack" with the alert ID - Investigate — open Grafana, the alert's "Runbook" link, the relevant Loki query
- Communicate — if the alert affects customers, post to the workspace's Matrix room ("we're aware, investigating")
- Mitigate — apply the immediate fix
- Resolve — let the alert auto-resolve (Prometheus stops firing) OR manually resolve in Alertmanager if the alert is wrong
- Postmortem — for P0 / P1, write a postmortem within 48h (see Incident postmortems)
Top alert runbooks
HighPodRestartRate
Pod is restarting more than expected.
Diagnose:
kubectl -n <namespace> get pods -l <selector>
kubectl -n <namespace> logs <pod> --previous --tail=200Common causes + fixes:
- OOM kill → bump the pod's memory limit
- CrashLoopBackOff on startup → image regression; roll back via
kubectl rollout undo - Liveness probe failing → check the application's
/healthz
TenantOdooDown
A tenant's worker pool reports unhealthy.
Diagnose:
kubectl -n plana-odoo-18 get pods -l app=worker-odoo
kubectl -n plana-odoo-18 logs deploy/worker-odoo --tail=200 | grep ERROR
curl -sI https://<tenant>.planapulse.app/web/healthCommon causes + fixes:
- All replicas crashloop → roll back the image; see also recent TenantUpgrade activity
- Specific tenant DB issue → check pg01 for the tenant DB
- Filestore mount failure → check NFS
CertExpiringSoon
A Let's Encrypt cert is within 7 days of expiry without renewal succeeding.
Diagnose:
kubectl -n envoy-gateway-system describe certificate <name>Common causes + fixes:
- DNS-01 challenge failing → check Exoscale DNS API token
- Rate limit hit at Let's Encrypt → wait it out; consider staging cert for tests
BackupJobFailed
A nightly tenant backup failed.
Diagnose:
kubectl -n backup get cronjob | grep <tenant>
kubectl -n backup logs job/<failed-job-name> --tail=200Common causes + fixes:
- SOS upload error → SOS credentials may have rotated; check secret
- pg01 connectivity → check pg01 health
- Disk full on holder pod → SOS write timeout
IngressLBUnhealthy
The Exoscale NLB reports unhealthy.
Diagnose: Exoscale console → Load Balancers → plana-pulse-eg-lb → instances.
Common causes:
- A node is rebooting or being replaced — usually self-heals
- Envoy Gateway pods crashed — restart via Flux suspend/resume
CrowdSecOrCorazaUnreachable
The fail-closed posture means an extauthz outage breaks all traffic.
Diagnose:
kubectl -n crowdsec get pods
kubectl -n envoy-gateway-system logs deploy/envoy --tail=100 | grep extauthzFix: Restart the affected component. The flag stays fail-closed (do not flip to fail-open even temporarily — that mistake was made in 2026-05-13).
P0 vs P1 vs P2
| Severity | Definition | Response time |
|---|---|---|
| P0 | Customer-facing outage; production down | 15 min ack + active work |
| P1 | Degraded service; some customers affected | 1 hour ack |
| P2 | Non-customer issue; internal tools affected | Next business day |
| P3 | Informational; no immediate action | Triage during business hours |
Where to read more
- Incident postmortems — the index
- Restoring from backup — when an alert leads to a restore
- Flux GitOps — how fixes deploy
- Source:
infra/k8s/monitoring/alerts/