INCIDENT-0814 ✓ RESOLVED 2026-04-28 · 14m 18s

CPU spike on web-02 — checkout deploy starved the worker pool

Sustained ≥80% CPU for 12 minutes. Sermon paged the operator, proposed a two-step remediation, and the change shipped on approval. Recovered to baseline within 4 seconds of intervention.

host web-02 · prod-edge · sfo-1

metric cpu.usage_pct

detected 14:32:18 UTC · by Sermon

paged 14:44:00 UTC · email + sms + imessage

resolved 14:46:30 UTC · operator: david

duration 14m 18s · 12m above threshold

impact p99 latency 1.82s on /api/checkout · ~140k req affected

actions 2 · 1 supervised, 1 reverted

Summary

Deploy api/v2.4.1 at 14:31:42 added a synchronous database call inside POST /api/checkout. At baseline 140 req/s the call latency was tolerable. When traffic climbed to 530 req/s starting 14:30:08 (3.8× normal), the gunicorn worker pool starved and CPU pegged.

Sermon detected sustained ≥80% CPU at 14:32:18, opened ALERT-0814, pulled the request log and parent service status, and paged the on-call operator at 14:44. The operator chose to fix; Sermon proposed a two-step remediation (SIGTERM the starved worker, revert the deploy). On approval at 14:48:02 both shipped.

The fresh worker came up in 4 seconds. CPU recovered to 31% baseline by 14:46:30. The revert PR merged via the standard CI gate.

cpu.usage_pct · web-02

2026-04-28 · 14:20–14:50 UTC 80% threshold cpu

peak 96.4% · 12m above threshold recovered → 31.2% baseline

Timeline

14:31:42 ci deploy api/v2.4.1 shipped via heimann/api#412
14:32:11 kernel gunicorn:worker pid 18472 spawned by api.gunicorn.service
14:32:18 sermon detected sustained ≥80% CPU on web-02 · ALERT-0814 opened
14:32:30 sermon pulled request log, top-proc snapshot, parent service status; correlated cpu × /api/checkout traffic 1:1
14:44:00 sermon paged david · email + sms + imessage delivered
14:45:18 david replied fix it
14:45:22 sermon drafted action plan: SIGTERM pid 18472 + revert api/v2.4.1
14:48:02 david approved & ship
14:48:04 sermon SIGTERM sent to pid 18472 (supervised, reversible inside 4s)
14:48:08 kernel fresh gunicorn worker spawned by api.gunicorn.service
14:48:30 sermon cpu recovered to 31.2% baseline · alert resolved
14:48:45 ci api#412 revert PR opened, merged via standard CI gate

Root cause

api/v2.4.1 added a synchronous call to orders.get_recent(user_id) inside the request handler for POST /api/checkout. The query was intentional — supporting a new "recommended next purchase" panel — but the call ran on the request thread instead of the existing async queue.

Per-request latency went from p99 38ms to p99 1820ms. The gunicorn worker count (4 per host, 8 hosts) couldn't keep up with a burst to 530 req/s. Workers spent >90% of their time blocked on the DB pool. CPU pegged on the worker that drew the bulk of the traffic.

Actions taken

✓

SIGTERM sent to pid 18472 supervised by api.gunicorn.service · fresh worker spawned in 4s

14:48:04

✓

heimann/api#412 reverted (v2.4.1 → v2.4.0) merged via standard CI gate · operator: david

14:48:45

Follow-ups

Created Linear issue

Linear ENG-419 Todo
Add load test for POST /api/checkout at 5× baseline req/s; gate the deploy on it
Created Linear issue

Linear ENG-420 Todo
Move orders.get_recent/1 to the async queue; render the panel client-side after first paint
@sermon register a watch for new synchronous DB calls inside request handlers in PR diffs
@sermon shorten time-to-page on sustained ≥80% CPU; current 12m is too long

authored by Sermon · 2026-04-28 14:48 UTC rendered from agent transcript · revisions tracked in source

The agent wrote this from the structured signals it pulled during the incident. Sermon publishes postmortems automatically when an alert resolves; the operator can edit before sharing or let the draft auto-publish after the cooldown window.