Symptoms
- Grafana alert:
up{job="arno-api"} == 0ORhttp_request_duration_ms{path="/health"} timeout - Healthchecks.io ping missed (configured к
https://arno-api.vadimpianof.workers.dev/health) - Direct check:
curl https://arno-api.vadimpianof.workers.dev/healthreturns non-200 OR hangs
Severity & escalation
- PAGE 24/7 — core API down = no writes, no Liveblocks auth, no GitHub sync
- Ack window: 15 min
- Escalate если не resolved за 30 min → engineering lead
- Если расследование указывает на Cloudflare-side outage → check status.cloudflare.com first; в этом случае ETA dependent on CF
Immediate actions (< 5 min)
- Reproduce:
curl -i https://arno-api.vadimpianof.workers.dev/health - Check Cloudflare status: https://www.cloudflarestatus.com/ (opens in a new tab) — если Workers / Dashboard incident → wait for resolution, log incident in incident channel
- Check recent deploy:
gh run list --repo vadimpianov/arno --limit 3— если последний CI пуш ≤30 min — может быть наш bug - Tail Worker logs:
cd apps/api && npx wrangler tail --config wrangler.toml --format=pretty- Если видишь runtime errors (uncaught throw, OOM) — иди к Recovery step "rollback"
- Если no logs (worker не запущен) — CF-side issue
- Sentry check: arno-backend project → most recent errors. Сортировать по
first_seendesc
Diagnosis (5-20 min)
Branch A: CF status page показывает incident
- ETA указан → wait
- Communicate ETA в incident channel + status page (если есть)
- Skip remaining diagnostic — это не наша проблема
Branch B: Recent deploy (≤30 min)
- Rollback:
gh workflow run deploy.yml -f rollback=true --ref mainOR directwrangler rollback - Verify:
curl /healthreturns 200 ≤2 min after rollback - Open incident retro task
Branch C: No recent deploy, CF up
- Probably config drift OR external dep cascade
- Check
wrangler secret list --config wrangler.toml— secrets present (DATABASE_URL, JWT_SECRET etc) - Check Neon dashboard → если Postgres suspended/down → иди к database_unreachable.md
- Check Liveblocks dashboard → если down → иди к liveblocks_outage.md. Note:
/healthНЕ зависит от Liveblocks но если они down + наш health hits их — fix/healthчтобы не depend
Branch D: Worker exists но returns 500
- Tail wrangler — find stack trace
- Common причины:
- JWT_SECRET malformed (rotation gone wrong)
- DATABASE_URL connection string wrong
- Required ENV not set
Recovery
| Issue | Action |
|---|---|
| Worker crashed by code bug | wrangler rollback (last known good) → fix → re-deploy |
| Secret missing | echo -n "value" | wrangler secret put NAME --config wrangler.toml |
| Dependency down | Add degraded mode flag (return 503 с body={degraded:true} from /ready, keep /health always 200 if Worker alive) |
| CF outage | Wait. Inform users via status page if downtime > 15 min |
Verification
curl https://arno-api.vadimpianof.workers.dev/healthreturns{"ok":true,"ts":<recent>}- Healthchecks.io shows green ping ≤2 min later
- Sentry error rate возвращается к baseline ≤5 min
Aftermath
- Post-mortem trigger: downtime ≥ 5 min
- Document: timeline, root cause, action items
- Update runbook если новый failure mode
Known false positives
- Network blip from Healthchecks.io probes — single missed ping не PAGE-worthy. Healthchecks конфиг: alert после 2 consecutive misses
- Cold start latency на free Workers tier — мы Paid, не actual case