Incident postmortems

Audience

PLANA staff. The index of public-internal postmortems for production incidents.

PLANA writes a postmortem for every P0 / P1 incident within 48 hours of resolution. Postmortems are blameless, focus on system causes rather than individual errors, and end with concrete action items.

The template

Every postmortem has these sections:

Headline — one-line summary of what broke
Timeline — minute-by-minute log: detection, escalation, fix, resolution
Impact — which customers were affected; what they saw; what they couldn't do
Root cause — the underlying technical or process cause
Contributing factors — what made the cause more likely or more severe
What went well — what we did right (tools, people, process)
What went poorly — what slowed us down or missed it
Action items — concrete follow-ups with owners and dates
Lessons learned — patterns to apply elsewhere

Stored in infra/docs/postmortems/<YYYY-MM-DD>-<slug>.md. Internal but shareable with customers on request when their workspace was affected.

Recent postmortems

2026-05-14 — Wave 5 cleanup wiped 10 ClusterRoleBindings (P0)

Headline: kubectl delete --all against a namespace also caught cluster-scoped resources, removing 10 ClusterRoleBindings that the SKS konnectivity agent depended on. Konnectivity outage lasted 14 hours.

Root cause: -n <namespace> is silently ignored on cluster-scoped resources. The cleanup script intended to scope to the namespace; the flag had no effect.

Action items:

✅ Drift checker added (catches this in <1h, not 14h)
✅ RBAC source-of-truth in infra/k8s/rbac-system/ with Flux reconciliation
✅ Linter on kubectl delete --all to flag cluster-scoped resources

Lessons: Never bundle cluster-scoped and namespaced resources in a single --all operation.

2026-05-20 — Forgejo SSH outage during gateway adoption (P1)

Headline: Flux adoption of the gateway exposed a latent bug in the Forgejo SSH TCP listener config. SSH push broke for ~2 hours.

Root cause: The TCP listener for git.planapulse.com:22 was configured against a stale Service ClusterIP. The NLB does not hairpin, so in-cluster CI runners couldn't push.

Action items:

✅ TCP listener now configured against the in-cluster Service URL
✅ Hairpin-NAT note added to documentation (under Architecture → Kubernetes)

2026-05-13 — saas-orchestrator namespace deleted (Wave 5)

Planned change; ran without incident. Not a postmortem per se, but the follow-on 2026-05-14 incident traces back to this cleanup's --all flag.

2026-04-26 — Penpot + ai-marketing crashloop noise (P2)

Headline: Two services in crashloop generated alert fatigue. HighPodRestartRate fired ~50 times in one day.

Root cause (Penpot): Memory limit too low post Authentik upgrade.

Root cause (ai-marketing): Missing env var caused startup failure.

Action items:

✅ Bumped Penpot memory limit
✅ ai-marketing startup hardened with explicit env validation
✅ HighPodRestartRate rule tuned (require 5 restarts in 30min, not 1 in 5min)

Postmortem culture

The rules:

Rule	Why
Blameless	Focus on systems, not people. We hire smart people who make mistakes; we want systems that make mistakes hard or recoverable
Within 48h	Memory fades. Capture details while they're fresh
Action items have owners + dates	Otherwise they don't ship
Tracked to closure	Open action items appear in the team's weekly review until done
Shareable with customers	On request; we don't volunteer customer-facing PMs but we never hide them

Where to read more

Alert response — the runbooks before the incident
Restoring from backup — common recovery step
Source: infra/docs/postmortems/ (internal)

Incident postmortems ​

The template ​

Recent postmortems ​

2026-05-14 — Wave 5 cleanup wiped 10 ClusterRoleBindings (P0) ​

2026-05-20 — Forgejo SSH outage during gateway adoption (P1) ​

2026-05-13 — saas-orchestrator namespace deleted (Wave 5) ​

2026-04-26 — Penpot + ai-marketing crashloop noise (P2) ​

Postmortem culture ​

Where to read more ​

Incident postmortems

The template

Recent postmortems

2026-05-14 — Wave 5 cleanup wiped 10 ClusterRoleBindings (P0)

2026-05-20 — Forgejo SSH outage during gateway adoption (P1)

2026-05-13 — saas-orchestrator namespace deleted (Wave 5)

2026-04-26 — Penpot + ai-marketing crashloop noise (P2)

Postmortem culture

Where to read more