SEV-1: Thundering Herd Cron Scheduling Incident

Summary

On February 21, 2026, the CoderDen platform experienced a full API outage lasting approximately 5-10 minutes. All requests returned errors or timed out. The root cause was a thundering herd created by multiple heavy cron-scheduled tasks all firing at the exact same time, overwhelming our PgBouncer connection pool and starving API requests of database connections.

Timeline

09:00:00 UTC

Cron schedules align: 28 context refresh tasks, home state batch (37 users), leaderboard refresh, dirty problems processing all fire simultaneously on the sage queue

09:00:01 UTC

All 3 sage workers occupied. Each running serial loops of 5+ DB queries per user. ~185 queries hit PgBouncer within seconds

09:00:08 UTC

Batch processing completes. System appears to recover briefly

09:30:00 UTC

Second wave fires: another round of home state refresh (38 users), leaderboard refresh, stale contexts batch, dirty problems — all at :30 mark again

~09:35 UTC

API becomes unresponsive. All user-facing requests fail or timeout

~09:38 UTC

Engineer identifies system is unresponsive. No OOM detected, no panics in kernel logs

09:40:42 UTC

Full VPS reboot initiated. PgBouncer restarts

09:40:46 UTC

Backend restarts. First attempt fails (DB still starting up), second attempt succeeds

09:41:50 UTC

Full recovery confirmed. All health checks passing, API serving requests

Root Cause

The Thundering Herd

Our Asynq-based task scheduler had accumulated multiple heavy periodic tasks over several feature launches. The problem: five of them were all scheduled at the :00 and :30 marks, creating a perfectly synchronized storm of database queries every 30 minutes.

Here’s what fired simultaneously at every :00:

Task	Schedule	Queue	DB Impact
Refresh stale contexts	`0 * * * *`	sage	16-28 users x 6+ queries each
Batch home state refresh	`/30 * * *`	sage	37-38 users x 5+ queries each
Process dirty problems	`/30 * * *`	sage	Vector similarity queries
Quiz leaderboard refresh	`0 * * * *`	low	Heavy aggregation query
Leaderboard refresh	`/30 * * *`	low	Materialized view refresh

With only 3 sage workers available, each processing users in a serial loop, the workers were occupied for 7+ seconds while hundreds of queries competed for PgBouncer’s 60 database connections.

Connection Pool Architecture

Our setup runs through PgBouncer in transaction pooling mode:

Go App (pgx pool: 50 conns) → PgBouncer (50+10 reserve = 60 DB conns) → PostgreSQL

Under normal conditions, this architecture handles load well — PgBouncer multiplexes connections efficiently. But when 185+ queries arrive within seconds from batch tasks, the queue depth inside PgBouncer grows rapidly. Meanwhile, API requests also need connections and get stuck waiting behind the batch queries. PgBouncer’s query timeout is set to 60 seconds — long enough that hundreds of queued queries can pile up before any are killed, turning a burst into a sustained blockage.

Resolution

We deployed two categories of fixes:

1. Schedule Staggering

We offset all heavy tasks so they never compete for the same workers:

Task	Before	After
Stale contexts refresh	`0 * * * *` (`:00`)	`0 * * * *` — unchanged, heaviest gets priority
Process dirty problems	`/30 * * *` (`:00/:30`)	`5,35 * * * *` (`:05/:35`)
Leaderboard refresh	`/30 * * *` (`:00/:30`)	`7,37 * * * *` (`:07/:37`)
Batch home state refresh	`/30 * * *` (`:00/:30`)	`10,40 * * * *` (`:10/:40`)
Quiz leaderboard	`0 * * * *` (`:00`)	`3 * * * *` (`:03`)

The new schedule distributes work across the hour:

:00  stale_contexts (heavy — gets all 3 sage workers)
:03  quiz_leaderboard (low queue, separate workers)
:05  dirty_problems + LLM_metrics (sage workers now free)
:07  leaderboards (low queue)
:10  batch_home_states (sage workers definitely free)
:15  stale_embeddings (sage)

2. Task Timeouts

Added 5-minute timeouts to the two heaviest batch tasks. If a batch takes longer than 5 minutes, Asynq kills it and frees the worker — preventing indefinite blocking.

Action Items

Stagger all cron schedules — No two heavy sage tasks fire at the same minute
Add timeouts to batch tasks — 5-minute cap on stale context and home state refreshes
Tune PgBouncer query timeout — Evaluate reducing QUERY_TIMEOUT from 60s to 15-30s so queued queries fail fast instead of piling up
Add connection pool monitoring — Alert when PgBouncer client wait time exceeds 5 seconds
Add Asynq queue depth alerting — Alert when sage queue depth exceeds 50 for 2+ minutes
Evaluate per-user task enqueueing — Instead of serial batch loops, enqueue individual lightweight tasks per user with deduplication
Persist container logs — Configure Docker log driver to write to persistent storage so pre-crash logs survive reboots

Lessons Learned

Cron alignment is a silent time bomb. Each feature added “just one more every-30-minutes job” without considering what else fires at the top and half of the hour. The aggregate effect was never assessed. When scheduling periodic tasks, always check what else runs at the same time.
Batch tasks need timeouts and backpressure. A serial loop processing 37 users with no timeout can block a worker indefinitely. Bounded execution time and queue depth limits are essential safety nets.
Pre-crash logs are invaluable. Because Docker container logs don’t survive recreation, we lost the exact error that preceded the outage. Persistent log storage (or a log shipping service) would have made diagnosis immediate instead of forensic.
A 60-second query timeout can still cascade. Our PgBouncer had a 60-second query timeout, but with hundreds of queries queuing up within seconds, that’s long enough for the backlog to grow faster than it drains. A shorter timeout (15-30s) combined with application-level retries would let the system shed load and recover instead of collapsing.