CPU spike on web-02 — checkout deploy starved the worker pool
Sustained ≥80% CPU for 12 minutes. Sentinel paged the operator, proposed a two-step remediation, and the change shipped on approval. Recovered to baseline within 4 seconds of intervention.
Summary
Deploy api/v2.4.1
at 14:31:42 added a synchronous database call inside POST /api/checkout. At baseline 140 req/s the call latency was
tolerable. When traffic climbed to 530 req/s starting 14:30:08 (3.8× normal),
the gunicorn worker pool starved and CPU pegged.
Sentinel detected sustained ≥80% CPU at 14:32:18, opened ALERT-0814, pulled the request log and parent service status, and paged the on-call operator at 14:44. The operator chose to fix; Sentinel proposed a two-step remediation (SIGTERM the starved worker, revert the deploy). On approval at 14:48:02 both shipped.
The fresh worker came up in 4 seconds. CPU recovered to 31% baseline by 14:46:30. The revert PR merged via the standard CI gate.
cpu.usage_pct · web-02
Timeline
-
14:31:42
ci
deploy
api/v2.4.1shipped viaheimann/api#412 - 14:32:11 kernel gunicorn:worker pid 18472 spawned by api.gunicorn.service
- 14:32:18 sentinel detected sustained ≥80% CPU on web-02 · ALERT-0814 opened
-
14:32:30
sentinel
pulled request log, top-proc snapshot, parent service status; correlated cpu ×
/api/checkouttraffic 1:1 - 14:44:00 sentinel paged david · email + sms + imessage delivered
-
14:45:18
david
replied
fix it -
14:45:22
sentinel
drafted action plan: SIGTERM pid 18472 + revert
api/v2.4.1 - 14:48:02 david approved & ship
- 14:48:04 sentinel SIGTERM sent to pid 18472 (supervised, reversible inside 4s)
- 14:48:08 kernel fresh gunicorn worker spawned by api.gunicorn.service
- 14:48:30 sentinel cpu recovered to 31.2% baseline · alert resolved
-
14:48:45
ci
api#412revert PR opened, merged via standard CI gate
Root cause
api/v2.4.1
added a synchronous call to orders.get_recent(user_id)
inside the request handler for POST /api/checkout. The query was
intentional — supporting a new "recommended next purchase" panel — but the call
ran on the request thread instead of the existing async queue.
Per-request latency went from p99 38ms to p99 1820ms. The gunicorn worker count (4 per host, 8 hosts) couldn't keep up with a burst to 530 req/s. Workers spent >90% of their time blocked on the DB pool. CPU pegged on the worker that drew the bulk of the traffic.
Actions taken
heimann/api#412 reverted (v2.4.1 → v2.4.0)
merged via standard CI gate · operator: david
Follow-ups
-
Created Linear issue
Linear ENG-419 TodoAdd load test for
POST /api/checkoutat 5× baseline req/s; gate the deploy on it -
Created Linear issue
Linear ENG-420 TodoMove
orders.get_recent/1to the async queue; render the panel client-side after first paint - @sentinel register a watch for new synchronous DB calls inside request handlers in PR diffs
- @sentinel shorten time-to-page on sustained ≥80% CPU; current 12m is too long