Runbooks
Runbook: /health endpoint down

Symptoms

  • Grafana alert: up{job="arno-api"} == 0 OR http_request_duration_ms{path="/health"} timeout
  • Healthchecks.io ping missed (configured к https://arno-api.vadimpianof.workers.dev/health)
  • Direct check: curl https://arno-api.vadimpianof.workers.dev/health returns non-200 OR hangs

Severity & escalation

  • PAGE 24/7 — core API down = no writes, no Liveblocks auth, no GitHub sync
  • Ack window: 15 min
  • Escalate если не resolved за 30 min → engineering lead
  • Если расследование указывает на Cloudflare-side outage → check status.cloudflare.com first; в этом случае ETA dependent on CF

Immediate actions (< 5 min)

  1. Reproduce: curl -i https://arno-api.vadimpianof.workers.dev/health
  2. Check Cloudflare status: https://www.cloudflarestatus.com/ (opens in a new tab) — если Workers / Dashboard incident → wait for resolution, log incident in incident channel
  3. Check recent deploy: gh run list --repo vadimpianov/arno --limit 3 — если последний CI пуш ≤30 min — может быть наш bug
  4. Tail Worker logs: cd apps/api && npx wrangler tail --config wrangler.toml --format=pretty
    • Если видишь runtime errors (uncaught throw, OOM) — иди к Recovery step "rollback"
    • Если no logs (worker не запущен) — CF-side issue
  5. Sentry check: arno-backend project → most recent errors. Сортировать по first_seen desc

Diagnosis (5-20 min)

Branch A: CF status page показывает incident

  • ETA указан → wait
  • Communicate ETA в incident channel + status page (если есть)
  • Skip remaining diagnostic — это не наша проблема

Branch B: Recent deploy (≤30 min)

  • Rollback: gh workflow run deploy.yml -f rollback=true --ref main OR direct wrangler rollback
  • Verify: curl /health returns 200 ≤2 min after rollback
  • Open incident retro task

Branch C: No recent deploy, CF up

  • Probably config drift OR external dep cascade
  • Check wrangler secret list --config wrangler.toml — secrets present (DATABASE_URL, JWT_SECRET etc)
  • Check Neon dashboard → если Postgres suspended/down → иди к database_unreachable.md
  • Check Liveblocks dashboard → если down → иди к liveblocks_outage.md. Note: /health НЕ зависит от Liveblocks но если они down + наш health hits их — fix /health чтобы не depend

Branch D: Worker exists но returns 500

  • Tail wrangler — find stack trace
  • Common причины:
    • JWT_SECRET malformed (rotation gone wrong)
    • DATABASE_URL connection string wrong
    • Required ENV not set

Recovery

IssueAction
Worker crashed by code bugwrangler rollback (last known good) → fix → re-deploy
Secret missingecho -n "value" | wrangler secret put NAME --config wrangler.toml
Dependency downAdd degraded mode flag (return 503 с body={degraded:true} from /ready, keep /health always 200 if Worker alive)
CF outageWait. Inform users via status page if downtime > 15 min

Verification

  • curl https://arno-api.vadimpianof.workers.dev/health returns {"ok":true,"ts":<recent>}
  • Healthchecks.io shows green ping ≤2 min later
  • Sentry error rate возвращается к baseline ≤5 min

Aftermath

  • Post-mortem trigger: downtime ≥ 5 min
  • Document: timeline, root cause, action items
  • Update runbook если новый failure mode

Known false positives

  • Network blip from Healthchecks.io probes — single missed ping не PAGE-worthy. Healthchecks конфиг: alert после 2 consecutive misses
  • Cold start latency на free Workers tier — мы Paid, не actual case