08 — Incident Response Runbook
An incident is any unplanned event that affects participants, clients, staff, data, or our public standing. This runbook tells anyone in the company what to do in the first 60 minutes.
Severity definitions
| Sev | Definition | Examples | Initial response |
|---|---|---|---|
| Sev 1 | Serious harm to a participant, material data breach, or existential business event | Suicidal ideation disclosure mishandled; PII leak; major media story; FCA breach | Page on-call within 15 min; CEO + DSL + DPO notified; comms freeze pending review |
| Sev 2 | Significant disruption to a client or cohort; safeguarding flag requiring same-day action; dashboard outage during a live review | Coach pulled mid-session; safeguarding flag with active risk indicator; >1h platform outage | On-call responds within 1h; HoD + relevant function lead; client comms drafted within 4h |
| Sev 3 | Localised issue, single participant or single session impact | Missed 1:1; pulse survey not sent; non-PII tooling glitch | Logged in Linear (incident label); resolved by owner within 5 working days |
On-call
- Weekly rota: one Delivery on-call + one Tech on-call
- DSL is always on-call for safeguarding (no rota — DSL or named deputy)
- DPO is always on-call for data
- Contact details + escalation tree in the on-call doc (linked from the runbook in Notion)
First-60-minutes playbook (Sev 1 / 2)
T+0 Whoever spots it raises the alarm in #incidents and pages on-call
T+5 On-call acknowledges, opens an Incident Doc (template below), declares severity
T+10 Comms freeze: no external statements until lead approves
T+15 Sev 1 → CEO, DSL, DPO, Board chair informed (whichever apply)
T+30 Containment actions in flight; named scribe taking timeline
T+60 Initial client / participant comms drafted (if applicable) and queued for approval
Incident doc template
Incident ID: INC-####
Severity: Sev 1 / 2 / 3
Opened: <UTC timestamp> | Opened by: <name>
Lead: <name> | Scribe: <name>
DSL/DPO/CEO informed at: <timestamps>
Summary (1 paragraph, plain English)
Timeline (UTC)
HH:MM - <event>
HH:MM - <event>
Impact
Participants affected:
Clients affected:
Data affected:
Services affected:
Containment (what we did to stop the bleeding)
Mitigation (what we did to reduce harm)
Communications
Internal: <when, to whom>
Client: <when, channel, approver>
Participants: <when, channel, approver>
Regulators: <when, body, reference>
Public: <when, channel, approver>
Status: Open / Contained / Resolved / Closed
Linked risk(s): RR-### (from §06)
Linked controls: which prevented harm, which failed
Comms templates
Templates live in docs/curriculum/sprint-3-onboarding/04-launch-comms-pack.md's sibling repo path — for incidents, use the Incident Comms templates:
- Client holding statement (within 4h)
- Client full update (within 24h)
- Participant safeguarding follow-up (DSL approved, same day)
- Regulator notification (DPO approved; 72h max for personal data breaches)
- Public statement (CEO + Board chair approved; only if material)
Post-incident review (PIR)
Held within 5 working days of an incident being Resolved.
- Chair: COO (or CEO for Sev 1)
- Attendees: incident lead, scribe, owners of affected functions, DSL/DPO when in scope
- Output: blameless write-up published to internal Notion within 10 working days, with:
- Timeline
- What went well
- What went badly
- Root cause (5-whys; no "human error" as a root cause)
- Actions with owners + due dates
- Risk register write-back: which RR-### was this, did controls work, what changes to §06 / §07
What we will not do
- We will not blame an individual in a PIR write-up.
- We will not delete or rewrite the timeline after the fact.
- We will not issue a public statement without CEO + Board chair sign-off.
- We will not close an incident with open actions.
