SEV-1

SEV-1: Thundering Herd Cron Scheduling Incident

Post-mortem of the Feb 21, 2026 outage caused by synchronized cron schedules overwhelming the database connection pool through PgBouncer.

C

CoderDen Team

Engineering

incident-report postgres pgbouncer reliability scheduling

Summary

On February 21, 2026, the CoderDen platform experienced a full API outage lasting approximately 5-10 minutes. All requests returned errors or timed out. The root cause was a thundering herd created by multiple heavy cron-scheduled tasks all firing at the exact same time, overwhelming our PgBouncer connection pool and starving API requests of database connections.

Timeline

Cron schedules align: 28 context refresh tasks, home state batch (37 users), leaderboard refresh, dirty problems processing all fire simultaneously on the sage queue

All 3 sage workers occupied. Each running serial loops of 5+ DB queries per user. ~185 queries hit PgBouncer within seconds

Batch processing completes. System appears to recover briefly

Second wave fires: another round of home state refresh (38 users), leaderboard refresh, stale contexts batch, dirty problems — all at :30 mark again

API becomes unresponsive. All user-facing requests fail or timeout

Engineer identifies system is unresponsive. No OOM detected, no panics in kernel logs

Full VPS reboot initiated. PgBouncer restarts

Backend restarts. First attempt fails (DB still starting up), second attempt succeeds

Full recovery confirmed. All health checks passing, API serving requests

Root Cause

The Thundering Herd

Our Asynq-based task scheduler had accumulated multiple heavy periodic tasks over several feature launches. The problem: five of them were all scheduled at the :00 and :30 marks, creating a perfectly synchronized storm of database queries every 30 minutes.

Here’s what fired simultaneously at every :00:

TaskScheduleQueueDB Impact
Refresh stale contexts0 * * * *sage16-28 users x 6+ queries each
Batch home state refresh*/30 * * * *sage37-38 users x 5+ queries each
Process dirty problems*/30 * * * *sageVector similarity queries
Quiz leaderboard refresh0 * * * *lowHeavy aggregation query
Leaderboard refresh*/30 * * * *lowMaterialized view refresh

With only 3 sage workers available, each processing users in a serial loop, the workers were occupied for 7+ seconds while hundreds of queries competed for PgBouncer’s 60 database connections.

Connection Pool Architecture

Our setup runs through PgBouncer in transaction pooling mode:

Go App (pgx pool: 50 conns) → PgBouncer (50+10 reserve = 60 DB conns) → PostgreSQL

Under normal conditions, this architecture handles load well — PgBouncer multiplexes connections efficiently. But when 185+ queries arrive within seconds from batch tasks, the queue depth inside PgBouncer grows rapidly. Meanwhile, API requests also need connections and get stuck waiting behind the batch queries. PgBouncer’s query timeout is set to 60 seconds — long enough that hundreds of queued queries can pile up before any are killed, turning a burst into a sustained blockage.

Resolution

We deployed two categories of fixes:

1. Schedule Staggering

We offset all heavy tasks so they never compete for the same workers:

TaskBeforeAfter
Stale contexts refresh0 * * * * (:00)0 * * * * — unchanged, heaviest gets priority
Process dirty problems*/30 * * * * (:00/:30)5,35 * * * * (:05/:35)
Leaderboard refresh*/30 * * * * (:00/:30)7,37 * * * * (:07/:37)
Batch home state refresh*/30 * * * * (:00/:30)10,40 * * * * (:10/:40)
Quiz leaderboard0 * * * * (:00)3 * * * * (:03)

The new schedule distributes work across the hour:

:00  stale_contexts (heavy — gets all 3 sage workers)
:03  quiz_leaderboard (low queue, separate workers)
:05  dirty_problems + LLM_metrics (sage workers now free)
:07  leaderboards (low queue)
:10  batch_home_states (sage workers definitely free)
:15  stale_embeddings (sage)

2. Task Timeouts

Added 5-minute timeouts to the two heaviest batch tasks. If a batch takes longer than 5 minutes, Asynq kills it and frees the worker — preventing indefinite blocking.

Action Items

  • Stagger all cron schedules — No two heavy sage tasks fire at the same minute
  • Add timeouts to batch tasks — 5-minute cap on stale context and home state refreshes
  • Tune PgBouncer query timeout — Evaluate reducing QUERY_TIMEOUT from 60s to 15-30s so queued queries fail fast instead of piling up
  • Add connection pool monitoring — Alert when PgBouncer client wait time exceeds 5 seconds
  • Add Asynq queue depth alerting — Alert when sage queue depth exceeds 50 for 2+ minutes
  • Evaluate per-user task enqueueing — Instead of serial batch loops, enqueue individual lightweight tasks per user with deduplication
  • Persist container logs — Configure Docker log driver to write to persistent storage so pre-crash logs survive reboots

Lessons Learned

  1. Cron alignment is a silent time bomb. Each feature added “just one more every-30-minutes job” without considering what else fires at the top and half of the hour. The aggregate effect was never assessed. When scheduling periodic tasks, always check what else runs at the same time.

  2. Batch tasks need timeouts and backpressure. A serial loop processing 37 users with no timeout can block a worker indefinitely. Bounded execution time and queue depth limits are essential safety nets.

  3. Pre-crash logs are invaluable. Because Docker container logs don’t survive recreation, we lost the exact error that preceded the outage. Persistent log storage (or a log shipping service) would have made diagnosis immediate instead of forensic.

  4. A 60-second query timeout can still cascade. Our PgBouncer had a 60-second query timeout, but with hundreds of queries queuing up within seconds, that’s long enough for the backlog to grow faster than it drains. A shorter timeout (15-30s) combined with application-level retries would let the system shed load and recover instead of collapsing.