Incident postmortems
Audience
PLANA staff. The index of public-internal postmortems for production incidents.
PLANA writes a postmortem for every P0 / P1 incident within 48 hours of resolution. Postmortems are blameless, focus on system causes rather than individual errors, and end with concrete action items.
The template
Every postmortem has these sections:
- Headline — one-line summary of what broke
- Timeline — minute-by-minute log: detection, escalation, fix, resolution
- Impact — which customers were affected; what they saw; what they couldn't do
- Root cause — the underlying technical or process cause
- Contributing factors — what made the cause more likely or more severe
- What went well — what we did right (tools, people, process)
- What went poorly — what slowed us down or missed it
- Action items — concrete follow-ups with owners and dates
- Lessons learned — patterns to apply elsewhere
Stored in infra/docs/postmortems/<YYYY-MM-DD>-<slug>.md. Internal but shareable with customers on request when their workspace was affected.
Recent postmortems
2026-05-14 — Wave 5 cleanup wiped 10 ClusterRoleBindings (P0)
Headline: kubectl delete --all against a namespace also caught cluster-scoped resources, removing 10 ClusterRoleBindings that the SKS konnectivity agent depended on. Konnectivity outage lasted 14 hours.
Root cause: -n <namespace> is silently ignored on cluster-scoped resources. The cleanup script intended to scope to the namespace; the flag had no effect.
Action items:
- ✅ Drift checker added (catches this in <1h, not 14h)
- ✅ RBAC source-of-truth in
infra/k8s/rbac-system/with Flux reconciliation - ✅ Linter on
kubectl delete --allto flag cluster-scoped resources
Lessons: Never bundle cluster-scoped and namespaced resources in a single --all operation.
2026-05-20 — Forgejo SSH outage during gateway adoption (P1)
Headline: Flux adoption of the gateway exposed a latent bug in the Forgejo SSH TCP listener config. SSH push broke for ~2 hours.
Root cause: The TCP listener for git.planapulse.com:22 was configured against a stale Service ClusterIP. The NLB does not hairpin, so in-cluster CI runners couldn't push.
Action items:
- ✅ TCP listener now configured against the in-cluster Service URL
- ✅ Hairpin-NAT note added to documentation (under Architecture → Kubernetes)
2026-05-13 — saas-orchestrator namespace deleted (Wave 5)
Planned change; ran without incident. Not a postmortem per se, but the follow-on 2026-05-14 incident traces back to this cleanup's --all flag.
2026-04-26 — Penpot + ai-marketing crashloop noise (P2)
Headline: Two services in crashloop generated alert fatigue. HighPodRestartRate fired ~50 times in one day.
Root cause (Penpot): Memory limit too low post Authentik upgrade.
Root cause (ai-marketing): Missing env var caused startup failure.
Action items:
- ✅ Bumped Penpot memory limit
- ✅ ai-marketing startup hardened with explicit env validation
- ✅ HighPodRestartRate rule tuned (require 5 restarts in 30min, not 1 in 5min)
Postmortem culture
The rules:
| Rule | Why |
|---|---|
| Blameless | Focus on systems, not people. We hire smart people who make mistakes; we want systems that make mistakes hard or recoverable |
| Within 48h | Memory fades. Capture details while they're fresh |
| Action items have owners + dates | Otherwise they don't ship |
| Tracked to closure | Open action items appear in the team's weekly review until done |
| Shareable with customers | On request; we don't volunteer customer-facing PMs but we never hide them |
Where to read more
- Alert response — the runbooks before the incident
- Restoring from backup — common recovery step
- Source:
infra/docs/postmortems/(internal)