Runbooks
Overview

Operational playbooks для PAGE alerts (24/7 escalation, master spec §II.6).

Цель runbook'а: on-call за 15 минут (ack window) знает что делать, не зная архитектуры.

Содержание

FileAlert symptomSeverity
health_down.md/health returns non-200 OR no responsePAGE
database_unreachable.mdNeon Postgres connection errorsPAGE
liveblocks_outage.mdLiveblocks API errors > 5% OR connection refusedPAGE
webhook_signature_spike.md401 на /webhooks/github > 10/min (potential attack)PAGE
url_import_ssrf_spike.mdSSRF guard rejections > 10/minINVESTIGATE
url_import_deployment.md(operational guide, не alert)OPERATIONAL

Quarterly review

Per master spec §II.9 — каждый quarter:

  1. Walk through каждого runbook'а
  2. Verify ссылки + URLs actual
  3. Remove resolved gotchas
  4. Add new failure modes encountered

Шаблон для нового runbook'а

# Runbook: <alert name>
 
## Symptoms
- Что видит on-call (precise wording)
- Где (Grafana panel / Sentry / etc.)
 
## Severity & escalation
- PAGE 24/7 / NOTIFY / TICKET
- Ack window: 15 min
- Escalate если не resolved за: 30 min → engineering lead
 
## Immediate actions (< 5 min)
1. ...
2. ...
 
## Diagnosis (5-20 min)
- Check X — если Y, иди к Z runbook
- Check W — ...
 
## Recovery
- Steps to restore
- Verification: what success looks like
 
## Aftermath
- Post-mortem trigger: если downtime > X min
- Backfill needed: ...
 
## Known false positives
- ...