Skip to content

Alert response

Audience

PLANA staff on-call. Customer-facing alerts (from BOS) are handled separately — see BOS → Alerts.

Platform alerts fire from Prometheus + Alertmanager into the #alerts Matrix room. PLANA does not page over email or SMS. This page covers the runbooks for the most common alerts and the general triage process.

Where alerts land

ChannelWhere
Matrix room#alerts on matrix.planapulse.com
Optional pagerPagerDuty for Enterprise tier on-call (P0 only)

All staff are in #alerts by default. Critical alerts auto-mention the on-call.

The triage process

When an alert fires:

  1. Acknowledge in #alerts — type "ack" with the alert ID
  2. Investigate — open Grafana, the alert's "Runbook" link, the relevant Loki query
  3. Communicate — if the alert affects customers, post to the workspace's Matrix room ("we're aware, investigating")
  4. Mitigate — apply the immediate fix
  5. Resolve — let the alert auto-resolve (Prometheus stops firing) OR manually resolve in Alertmanager if the alert is wrong
  6. Postmortem — for P0 / P1, write a postmortem within 48h (see Incident postmortems)

Top alert runbooks

HighPodRestartRate

Pod is restarting more than expected.

Diagnose:

bash
kubectl -n <namespace> get pods -l <selector>
kubectl -n <namespace> logs <pod> --previous --tail=200

Common causes + fixes:

  • OOM kill → bump the pod's memory limit
  • CrashLoopBackOff on startup → image regression; roll back via kubectl rollout undo
  • Liveness probe failing → check the application's /healthz

TenantOdooDown

A tenant's worker pool reports unhealthy.

Diagnose:

bash
kubectl -n plana-odoo-18 get pods -l app=worker-odoo
kubectl -n plana-odoo-18 logs deploy/worker-odoo --tail=200 | grep ERROR
curl -sI https://<tenant>.planapulse.app/web/health

Common causes + fixes:

  • All replicas crashloop → roll back the image; see also recent TenantUpgrade activity
  • Specific tenant DB issue → check pg01 for the tenant DB
  • Filestore mount failure → check NFS

CertExpiringSoon

A Let's Encrypt cert is within 7 days of expiry without renewal succeeding.

Diagnose:

bash
kubectl -n envoy-gateway-system describe certificate <name>

Common causes + fixes:

  • DNS-01 challenge failing → check Exoscale DNS API token
  • Rate limit hit at Let's Encrypt → wait it out; consider staging cert for tests

BackupJobFailed

A nightly tenant backup failed.

Diagnose:

bash
kubectl -n backup get cronjob | grep <tenant>
kubectl -n backup logs job/<failed-job-name> --tail=200

Common causes + fixes:

  • SOS upload error → SOS credentials may have rotated; check secret
  • pg01 connectivity → check pg01 health
  • Disk full on holder pod → SOS write timeout

IngressLBUnhealthy

The Exoscale NLB reports unhealthy.

Diagnose: Exoscale console → Load Balancers → plana-pulse-eg-lb → instances.

Common causes:

  • A node is rebooting or being replaced — usually self-heals
  • Envoy Gateway pods crashed — restart via Flux suspend/resume

CrowdSecOrCorazaUnreachable

The fail-closed posture means an extauthz outage breaks all traffic.

Diagnose:

bash
kubectl -n crowdsec get pods
kubectl -n envoy-gateway-system logs deploy/envoy --tail=100 | grep extauthz

Fix: Restart the affected component. The flag stays fail-closed (do not flip to fail-open even temporarily — that mistake was made in 2026-05-13).

P0 vs P1 vs P2

SeverityDefinitionResponse time
P0Customer-facing outage; production down15 min ack + active work
P1Degraded service; some customers affected1 hour ack
P2Non-customer issue; internal tools affectedNext business day
P3Informational; no immediate actionTriage during business hours

Where to read more

© PLANA Digital Ltd.