Symptoms
- Sentry: spike
error_type: NeonConnectionErrorORPostgresError - API endpoints возвращают 500 с
code: "internal"и message mentioning DB /readyendpoint reportsdegraded: trueс hard dep "postgres"- Grafana:
pg_up == 0OR Postgres connection latency > 5s
Severity & escalation
- PAGE 24/7 — writes полностью отключены. Read fallback (snapshots в KV) живёт 60s после last refresh, далее stale
- Ack window: 15 min
- Escalate за 30 min → engineering lead
- При Neon-side outage без ETA → consider PITR restore to secondary region (Phase 16+ post-MVP)
Immediate actions (< 5 min)
- Reproduce: wrangler tail должен show DB errors stream
cd apps/api && npx wrangler tail --config wrangler.toml --format=pretty - Check Neon status: https://neon.tech/status (opens in a new tab)
- Open Neon dashboard: https://console.neon.tech (opens in a new tab) → project
arno-prod(Frankfurt)- Если показывает Suspended (free tier auto-suspend) → click открыть, wake up. Resume ~10s
- Если показывает Error / Down → wait OR contact Neon support
- Verify connection from CLI:
psql 'postgresql://neondb_owner:***@ep-dry-block-al36bkvg.c-3.eu-central-1.aws.neon.tech/neondb?sslmode=require' -c 'SELECT 1'
Diagnosis (5-20 min)
Branch A: Neon suspended (free tier)
- Это expected behaviour на idle period > 5 min для free
- Mitigation: upgrade к Neon Launch ($19/mo) когда DAU > 50 OR business critical (per cost ladder)
- Open project в dashboard → resume → verify
/readyзелёный
Branch B: Neon up but Workers can't connect
- Check
DATABASE_URLwrangler secret format — должен бытьpostgresql://user:pass@host/db?sslmode=require - Check Neon allowlist (IP) — Workers exit IPs unstable, поэтому Neon должен accept any IP (default для serverless tier)
- Check rate limit — Neon free 100 concurrent connections, мы used Drizzle HTTP driver (stateless, no pool)
- Run direct query via
tools/migrate/(Pool driver) → если works → HTTP driver issue (unlikely)
Branch C: Neon-side outage
- Status page показывает incident → wait
- Inform users via status page если > 5 min downtime
- Consider PITR restore: Neon → branch → PITR to point before outage
Recovery
| Issue | Action |
|---|---|
| Suspended (free tier) | Resume in dashboard; wake takes ~10s |
| Bad DATABASE_URL | wrangler secret put DATABASE_URL --config wrangler.toml < new-url.txt |
| Region outage | Wait Neon, OR — emergency — branch creation в другой region + repoint DATABASE_URL (PITR-based) |
| Rate-limit | Investigate runaway queries (Sentry traces) → kill stuck connections via Neon dashboard |
Verification
/readyendpoint returnsdegraded: false- API endpoints с write (
POST /api/v1/projects) succeed - Sentry error rate возвращается к baseline ≤5 min after recovery
- Grafana
pg_up == 1persistent for 5 min
Aftermath
- Post-mortem trigger: downtime ≥ 10 min
- Document: timeline, root cause, was PITR needed?
- Если free tier suspend случается часто (>1/неделю) → upgrade к Launch
Known false positives
- PITR backups window (нощно ~03:00 UTC) — short read latency spike, не unreachable. Не PAGE
- Test queries from ARNO CI — могут briefly увеличивать connection count; не PAGE