Sermon postmortem
← back to home
INCIDENT-0814 ✓ RESOLVED 2026-04-28 · 14m 18s

CPU spike on web-02 — checkout deploy starved the worker pool

Sustained ≥80% CPU for 12 minutes. Sentinel paged the operator, proposed a two-step remediation, and the change shipped on approval. Recovered to baseline within 4 seconds of intervention.

host web-02 · prod-edge · sfo-1
metric cpu.usage_pct
detected 14:32:18 UTC · by Sentinel
paged 14:44:00 UTC · email + sms + imessage
resolved 14:46:30 UTC · operator: david
duration 14m 18s · 12m above threshold
impact p99 latency 1.82s on /api/checkout · ~140k req affected
actions 2 · 1 supervised, 1 reverted

Summary

Deploy api/v2.4.1 at 14:31:42 added a synchronous database call inside POST /api/checkout. At baseline 140 req/s the call latency was tolerable. When traffic climbed to 530 req/s starting 14:30:08 (3.8× normal), the gunicorn worker pool starved and CPU pegged.

Sentinel detected sustained ≥80% CPU at 14:32:18, opened ALERT-0814, pulled the request log and parent service status, and paged the on-call operator at 14:44. The operator chose to fix; Sentinel proposed a two-step remediation (SIGTERM the starved worker, revert the deploy). On approval at 14:48:02 both shipped.

The fresh worker came up in 4 seconds. CPU recovered to 31% baseline by 14:46:30. The revert PR merged via the standard CI gate.

cpu.usage_pct · web-02

2026-04-28 · 14:20–14:50 UTC 80% threshold cpu
100 66 33 0 80% 14:32 SPIKE 14:46 KILLED
peak 96.4% · 12m above threshold recovered → 31.2% baseline

Timeline

  1. 14:31:42 ci deploy api/v2.4.1 shipped via heimann/api#412
  2. 14:32:11 kernel gunicorn:worker pid 18472 spawned by api.gunicorn.service
  3. 14:32:18 sentinel detected sustained ≥80% CPU on web-02 · ALERT-0814 opened
  4. 14:32:30 sentinel pulled request log, top-proc snapshot, parent service status; correlated cpu × /api/checkout traffic 1:1
  5. 14:44:00 sentinel paged david · email + sms + imessage delivered
  6. 14:45:18 david replied fix it
  7. 14:45:22 sentinel drafted action plan: SIGTERM pid 18472 + revert api/v2.4.1
  8. 14:48:02 david approved & ship
  9. 14:48:04 sentinel SIGTERM sent to pid 18472 (supervised, reversible inside 4s)
  10. 14:48:08 kernel fresh gunicorn worker spawned by api.gunicorn.service
  11. 14:48:30 sentinel cpu recovered to 31.2% baseline · alert resolved
  12. 14:48:45 ci api#412 revert PR opened, merged via standard CI gate

Root cause

api/v2.4.1 added a synchronous call to orders.get_recent(user_id) inside the request handler for POST /api/checkout. The query was intentional — supporting a new "recommended next purchase" panel — but the call ran on the request thread instead of the existing async queue.

Per-request latency went from p99 38ms to p99 1820ms. The gunicorn worker count (4 per host, 8 hosts) couldn't keep up with a burst to 530 req/s. Workers spent >90% of their time blocked on the DB pool. CPU pegged on the worker that drew the bulk of the traffic.

Actions taken

SIGTERM sent to pid 18472 supervised by api.gunicorn.service · fresh worker spawned in 4s
14:48:04
heimann/api#412 reverted (v2.4.1 → v2.4.0) merged via standard CI gate · operator: david
14:48:45

Follow-ups

  • Created Linear issue
    Linear ENG-419 Todo
    Add load test for POST /api/checkout at 5× baseline req/s; gate the deploy on it
  • Created Linear issue
    Linear ENG-420 Todo
    Move orders.get_recent/1 to the async queue; render the panel client-side after first paint
  • @sentinel register a watch for new synchronous DB calls inside request handlers in PR diffs
  • @sentinel shorten time-to-page on sustained ≥80% CPU; current 12m is too long
authored by Sentinel · 2026-04-28 14:48 UTC rendered from agent transcript · revisions tracked in source

The agent wrote this from the structured signals it pulled during the incident. Sermon publishes postmortems automatically when an alert resolves; the operator can edit before sharing or let the draft auto-publish after the cooldown window.