Alert response

Audience

PLANA staff on-call. Customer-facing alerts (from BOS) are handled separately — see BOS → Alerts.

Platform alerts fire from Prometheus + Alertmanager into the #alerts Matrix room. PLANA does not page over email or SMS. This page covers the runbooks for the most common alerts and the general triage process.

Where alerts land

Channel	Where
Matrix room	`#alerts` on `matrix.planapulse.com`
Optional pager	PagerDuty for Enterprise tier on-call (P0 only)

All staff are in #alerts by default. Critical alerts auto-mention the on-call.

The triage process

When an alert fires:

Acknowledge in #alerts — type "ack" with the alert ID
Investigate — open Grafana, the alert's "Runbook" link, the relevant Loki query
Communicate — if the alert affects customers, post to the workspace's Matrix room ("we're aware, investigating")
Mitigate — apply the immediate fix
Resolve — let the alert auto-resolve (Prometheus stops firing) OR manually resolve in Alertmanager if the alert is wrong
Postmortem — for P0 / P1, write a postmortem within 48h (see Incident postmortems)

Top alert runbooks

`HighPodRestartRate`

Pod is restarting more than expected.

Diagnose:

bash

kubectl -n <namespace> get pods -l <selector>
kubectl -n <namespace> logs <pod> --previous --tail=200

Common causes + fixes:

OOM kill → bump the pod's memory limit
CrashLoopBackOff on startup → image regression; roll back via kubectl rollout undo
Liveness probe failing → check the application's /healthz

`TenantOdooDown`

A tenant's worker pool reports unhealthy.

Diagnose:

bash

kubectl -n plana-odoo-18 get pods -l app=worker-odoo
kubectl -n plana-odoo-18 logs deploy/worker-odoo --tail=200 | grep ERROR
curl -sI https://<tenant>.planapulse.app/web/health

Common causes + fixes:

All replicas crashloop → roll back the image; see also recent TenantUpgrade activity
Specific tenant DB issue → check pg01 for the tenant DB
Filestore mount failure → check NFS

`CertExpiringSoon`

A Let's Encrypt cert is within 7 days of expiry without renewal succeeding.

Diagnose:

bash

kubectl -n envoy-gateway-system describe certificate <name>

Common causes + fixes:

DNS-01 challenge failing → check Exoscale DNS API token
Rate limit hit at Let's Encrypt → wait it out; consider staging cert for tests

`BackupJobFailed`

A nightly tenant backup failed.

Diagnose:

bash

kubectl -n backup get cronjob | grep <tenant>
kubectl -n backup logs job/<failed-job-name> --tail=200

Common causes + fixes:

SOS upload error → SOS credentials may have rotated; check secret
pg01 connectivity → check pg01 health
Disk full on holder pod → SOS write timeout

`IngressLBUnhealthy`

The Exoscale NLB reports unhealthy.

Diagnose: Exoscale console → Load Balancers → plana-pulse-eg-lb → instances.

Common causes:

A node is rebooting or being replaced — usually self-heals
Envoy Gateway pods crashed — restart via Flux suspend/resume

`CrowdSecOrCorazaUnreachable`

The fail-closed posture means an extauthz outage breaks all traffic.

Diagnose:

bash

kubectl -n crowdsec get pods
kubectl -n envoy-gateway-system logs deploy/envoy --tail=100 | grep extauthz

Fix: Restart the affected component. The flag stays fail-closed (do not flip to fail-open even temporarily — that mistake was made in 2026-05-13).

P0 vs P1 vs P2

Severity	Definition	Response time
P0	Customer-facing outage; production down	15 min ack + active work
P1	Degraded service; some customers affected	1 hour ack
P2	Non-customer issue; internal tools affected	Next business day
P3	Informational; no immediate action	Triage during business hours

Where to read more

Incident postmortems — the index
Restoring from backup — when an alert leads to a restore
Flux GitOps — how fixes deploy
Source: infra/k8s/monitoring/alerts/

Alert response ​

Where alerts land ​

The triage process ​

Top alert runbooks ​

HighPodRestartRate ​

TenantOdooDown ​

CertExpiringSoon ​

BackupJobFailed ​

IngressLBUnhealthy ​

CrowdSecOrCorazaUnreachable ​

P0 vs P1 vs P2 ​

Where to read more ​

Alert response

Where alerts land

The triage process

Top alert runbooks

`HighPodRestartRate`

`TenantOdooDown`

`CertExpiringSoon`

`BackupJobFailed`

`IngressLBUnhealthy`

`CrowdSecOrCorazaUnreachable`

P0 vs P1 vs P2

Where to read more