SEV-1

SEV-1: Database Connection Exhaustion Incident

Post-mortem of the Feb 19, 2026 outage caused by aggressive background task scheduling and unclosed database connections.

A

Akash

Founder & Lead Engineer

incident-report postgres reliability

Summary

On February 19, 2026, the CoderDen platform experienced a full outage lasting approximately 45 minutes. All API requests returned 503 errors. The root cause was database connection pool exhaustion driven by two compounding issues:

  1. Overly aggressive background task scheduling — periodic jobs running every 1-10 minutes were saturating the PostgreSQL connection pool
  2. An unclosed database query in a high-traffic API endpoint — a missing defer rows.Close() caused connection leaks under load

Timeline

Monitoring alerts fire: API response times spike to >10s

Health checks begin failing. 503s across all endpoints

Engineer on-call begins investigation. PostgreSQL max_connections at limit

Identified: Background workers holding 80%+ of available connections

Identified: Connection leak in a high-traffic endpoint amplifying the problem

Fix deployed: Reduced cron frequencies across all scheduled tasks

Fix deployed: Added proper connection cleanup and error handling

Connection pool returning to normal. APIs recovering

Full recovery confirmed. All health checks passing

Root Cause

Issue 1: Aggressive Scheduler Intervals

Our task scheduler had accumulated many periodic jobs over several feature launches, each with aggressive intervals. The total number of concurrent background workers was 14 across multiple queues — each holding a database connection during execution. With tasks firing every 1-10 minutes, the connection pool was perpetually near capacity.

The fix was straightforward: we audited every periodic task and relaxed intervals where real-time freshness wasn’t needed. Leaderboard refreshes went from every 10 minutes to every 30. Notification processing from every minute to every 5. Overall background task load was reduced by roughly 60%.

Issue 2: Connection Leak in a High-Traffic Endpoint

A frequently accessed API endpoint contained a raw SQL query where the error return was being silently discarded. When the query succeeded, the returned rows object wasn’t always closed — particularly under early-return or panic-recovery paths. This meant the underlying database connection was never returned to the pool.

Under normal load this went unnoticed. But under the pressure of a nearly-full connection pool, even a small leak rate compounded the problem and pushed us over the edge.

The fix: check the error, and always defer rows.Close() immediately after a successful query.

Resolution

Two changes were deployed together:

  1. Reduced all scheduler intervals — Tasks that don’t need real-time freshness were relaxed to longer intervals. Total background task load reduced by ~60%.

  2. Fixed the connection leak — Added proper error checking and deferred cleanup on the affected endpoint. Also audited other raw queries across the codebase for similar patterns.

Action Items

  • Add connection pool metrics to monitoring — Alert when pool utilization exceeds 70%
  • Static analysis for unclosed rows — Add a lint rule or code review checklist item for raw database queries that don’t use deferred cleanup
  • Load test background task frequency — Simulate production-like task volume in staging before deploying new periodic jobs
  • Tune connection pool settings — Review idle connection limits and max connection lifetime for better resource reuse
  • Evaluate task deduplication — Prevent overlapping executions of the same background task

Lessons Learned

  1. Background task frequency compounds — Each new feature added “just one more cron job,” but the aggregate effect on the connection pool was never assessed holistically.

  2. Discarding errors is a bug — Silently ignoring error returns from database calls masked a critical signal. In Go, always handle errors.

  3. Production load patterns differ from staging — The scheduler ran fine in staging because there was no concurrent user traffic competing for connections. Load testing should include background task simulation.