SEV-1: Thundering Herd Cron Scheduling Incident
Post-mortem of the Feb 21, 2026 outage caused by synchronized cron schedules overwhelming the database connection pool through PgBouncer.
CoderDen Team
Engineering
Summary
On February 21, 2026, the CoderDen platform experienced a full API outage lasting approximately 5-10 minutes. All requests returned errors or timed out. The root cause was a thundering herd created by multiple heavy cron-scheduled tasks all firing at the exact same time, overwhelming our PgBouncer connection pool and starving API requests of database connections.
Timeline
Cron schedules align: 28 context refresh tasks, home state batch (37 users), leaderboard refresh, dirty problems processing all fire simultaneously on the sage queue
All 3 sage workers occupied. Each running serial loops of 5+ DB queries per user. ~185 queries hit PgBouncer within seconds
Batch processing completes. System appears to recover briefly
Second wave fires: another round of home state refresh (38 users), leaderboard refresh, stale contexts batch, dirty problems — all at :30 mark again
API becomes unresponsive. All user-facing requests fail or timeout
Engineer identifies system is unresponsive. No OOM detected, no panics in kernel logs
Full VPS reboot initiated. PgBouncer restarts
Backend restarts. First attempt fails (DB still starting up), second attempt succeeds
Full recovery confirmed. All health checks passing, API serving requests
Root Cause
The Thundering Herd
Our Asynq-based task scheduler had accumulated multiple heavy periodic tasks over several feature launches. The problem: five of them were all scheduled at the :00 and :30 marks, creating a perfectly synchronized storm of database queries every 30 minutes.
Here’s what fired simultaneously at every :00:
| Task | Schedule | Queue | DB Impact |
|---|---|---|---|
| Refresh stale contexts | 0 * * * * | sage | 16-28 users x 6+ queries each |
| Batch home state refresh | */30 * * * * | sage | 37-38 users x 5+ queries each |
| Process dirty problems | */30 * * * * | sage | Vector similarity queries |
| Quiz leaderboard refresh | 0 * * * * | low | Heavy aggregation query |
| Leaderboard refresh | */30 * * * * | low | Materialized view refresh |
With only 3 sage workers available, each processing users in a serial loop, the workers were occupied for 7+ seconds while hundreds of queries competed for PgBouncer’s 60 database connections.
Connection Pool Architecture
Our setup runs through PgBouncer in transaction pooling mode:
Go App (pgx pool: 50 conns) → PgBouncer (50+10 reserve = 60 DB conns) → PostgreSQL
Under normal conditions, this architecture handles load well — PgBouncer multiplexes connections efficiently. But when 185+ queries arrive within seconds from batch tasks, the queue depth inside PgBouncer grows rapidly. Meanwhile, API requests also need connections and get stuck waiting behind the batch queries. PgBouncer’s query timeout is set to 60 seconds — long enough that hundreds of queued queries can pile up before any are killed, turning a burst into a sustained blockage.
Resolution
We deployed two categories of fixes:
1. Schedule Staggering
We offset all heavy tasks so they never compete for the same workers:
| Task | Before | After |
|---|---|---|
| Stale contexts refresh | 0 * * * * (:00) | 0 * * * * — unchanged, heaviest gets priority |
| Process dirty problems | */30 * * * * (:00/:30) | 5,35 * * * * (:05/:35) |
| Leaderboard refresh | */30 * * * * (:00/:30) | 7,37 * * * * (:07/:37) |
| Batch home state refresh | */30 * * * * (:00/:30) | 10,40 * * * * (:10/:40) |
| Quiz leaderboard | 0 * * * * (:00) | 3 * * * * (:03) |
The new schedule distributes work across the hour:
:00 stale_contexts (heavy — gets all 3 sage workers)
:03 quiz_leaderboard (low queue, separate workers)
:05 dirty_problems + LLM_metrics (sage workers now free)
:07 leaderboards (low queue)
:10 batch_home_states (sage workers definitely free)
:15 stale_embeddings (sage)
2. Task Timeouts
Added 5-minute timeouts to the two heaviest batch tasks. If a batch takes longer than 5 minutes, Asynq kills it and frees the worker — preventing indefinite blocking.
Action Items
- Stagger all cron schedules — No two heavy sage tasks fire at the same minute
- Add timeouts to batch tasks — 5-minute cap on stale context and home state refreshes
- Tune PgBouncer query timeout — Evaluate reducing QUERY_TIMEOUT from 60s to 15-30s so queued queries fail fast instead of piling up
- Add connection pool monitoring — Alert when PgBouncer client wait time exceeds 5 seconds
- Add Asynq queue depth alerting — Alert when sage queue depth exceeds 50 for 2+ minutes
- Evaluate per-user task enqueueing — Instead of serial batch loops, enqueue individual lightweight tasks per user with deduplication
- Persist container logs — Configure Docker log driver to write to persistent storage so pre-crash logs survive reboots
Lessons Learned
-
Cron alignment is a silent time bomb. Each feature added “just one more every-30-minutes job” without considering what else fires at the top and half of the hour. The aggregate effect was never assessed. When scheduling periodic tasks, always check what else runs at the same time.
-
Batch tasks need timeouts and backpressure. A serial loop processing 37 users with no timeout can block a worker indefinitely. Bounded execution time and queue depth limits are essential safety nets.
-
Pre-crash logs are invaluable. Because Docker container logs don’t survive recreation, we lost the exact error that preceded the outage. Persistent log storage (or a log shipping service) would have made diagnosis immediate instead of forensic.
-
A 60-second query timeout can still cascade. Our PgBouncer had a 60-second query timeout, but with hundreds of queries queuing up within seconds, that’s long enough for the backlog to grow faster than it drains. A shorter timeout (15-30s) combined with application-level retries would let the system shed load and recover instead of collapsing.