Skip to content

Incident postmortems

Audience

PLANA staff. The index of public-internal postmortems for production incidents.

PLANA writes a postmortem for every P0 / P1 incident within 48 hours of resolution. Postmortems are blameless, focus on system causes rather than individual errors, and end with concrete action items.

The template

Every postmortem has these sections:

  1. Headline — one-line summary of what broke
  2. Timeline — minute-by-minute log: detection, escalation, fix, resolution
  3. Impact — which customers were affected; what they saw; what they couldn't do
  4. Root cause — the underlying technical or process cause
  5. Contributing factors — what made the cause more likely or more severe
  6. What went well — what we did right (tools, people, process)
  7. What went poorly — what slowed us down or missed it
  8. Action items — concrete follow-ups with owners and dates
  9. Lessons learned — patterns to apply elsewhere

Stored in infra/docs/postmortems/<YYYY-MM-DD>-<slug>.md. Internal but shareable with customers on request when their workspace was affected.

Recent postmortems

2026-05-14 — Wave 5 cleanup wiped 10 ClusterRoleBindings (P0)

Headline: kubectl delete --all against a namespace also caught cluster-scoped resources, removing 10 ClusterRoleBindings that the SKS konnectivity agent depended on. Konnectivity outage lasted 14 hours.

Root cause: -n <namespace> is silently ignored on cluster-scoped resources. The cleanup script intended to scope to the namespace; the flag had no effect.

Action items:

  • ✅ Drift checker added (catches this in <1h, not 14h)
  • ✅ RBAC source-of-truth in infra/k8s/rbac-system/ with Flux reconciliation
  • ✅ Linter on kubectl delete --all to flag cluster-scoped resources

Lessons: Never bundle cluster-scoped and namespaced resources in a single --all operation.

2026-05-20 — Forgejo SSH outage during gateway adoption (P1)

Headline: Flux adoption of the gateway exposed a latent bug in the Forgejo SSH TCP listener config. SSH push broke for ~2 hours.

Root cause: The TCP listener for git.planapulse.com:22 was configured against a stale Service ClusterIP. The NLB does not hairpin, so in-cluster CI runners couldn't push.

Action items:

  • ✅ TCP listener now configured against the in-cluster Service URL
  • ✅ Hairpin-NAT note added to documentation (under Architecture → Kubernetes)

2026-05-13 — saas-orchestrator namespace deleted (Wave 5)

Planned change; ran without incident. Not a postmortem per se, but the follow-on 2026-05-14 incident traces back to this cleanup's --all flag.

2026-04-26 — Penpot + ai-marketing crashloop noise (P2)

Headline: Two services in crashloop generated alert fatigue. HighPodRestartRate fired ~50 times in one day.

Root cause (Penpot): Memory limit too low post Authentik upgrade.

Root cause (ai-marketing): Missing env var caused startup failure.

Action items:

  • ✅ Bumped Penpot memory limit
  • ✅ ai-marketing startup hardened with explicit env validation
  • ✅ HighPodRestartRate rule tuned (require 5 restarts in 30min, not 1 in 5min)

Postmortem culture

The rules:

RuleWhy
BlamelessFocus on systems, not people. We hire smart people who make mistakes; we want systems that make mistakes hard or recoverable
Within 48hMemory fades. Capture details while they're fresh
Action items have owners + datesOtherwise they don't ship
Tracked to closureOpen action items appear in the team's weekly review until done
Shareable with customersOn request; we don't volunteer customer-facing PMs but we never hide them

Where to read more

© PLANA Digital Ltd.