SEV-1

SEV-1: Redis Freeze — KVM Host Contention on Shared VPS

Post-mortem of the recurring Redis outages from Feb 21-24, 2026 caused by KVM hypervisor memory overcommit on a shared VPS, freezing Redis in kernel D-state and blocking all Asynq task processing.

C

CoderDen Team

Engineering

incident-report redis infrastructure reliability devops

Summary

Between February 21-24, 2026, the CoderDen platform experienced three separate full outages, each lasting 4-17 hours. All Asynq background task processing stopped, AI course generation returned 500 errors, and users couldn’t access features dependent on Redis. The root cause was KVM hypervisor memory overcommit on our shared VPS host — the physical server was paging out our VM’s memory, causing Redis to enter an unrecoverable kernel D-state (uninterruptible sleep) during RDB snapshot forks. Each outage required a full VPS reboot to recover.

Timeline

Redis completes its last successful RDB snapshot. Logs stop — no crash, no error, no OOM. Process silently enters kernel D-state

Backend starts logging 'dial tcp 10.0.1.9:6379: i/o timeout' across all Asynq queues. Health check returns 200 (bug — Redis failure wasn't flagged as unhealthy)

User hits AI course generation endpoint, gets 500 — 'failed to store session in Redis'. User retries 4 times, all fail

Engineer investigates. Redis container shows 'Up 12 hours (unhealthy)'. docker exec redis-cli ping hangs indefinitely

docker kill and docker restart both hang — process is in kernel D-state, unkillable without reboot

Full VPS reboot. All services recover. Redis loads last RDB snapshot from 17:48

Second freeze. Identical pattern — Redis RDB save succeeds, then silence. Process enters D-state

Second VPS reboot to recover

Third freeze. Redis container shows 6,100 stuck PIDs, 0% CPU, 140MB memory. Completely frozen

dmesg analysis reveals kvm_async_pf_task_wait_schedule — hypervisor page fault stalls. Root cause identified

Third VPS reboot. RDB saves disabled to prevent future fork-triggered freezes. Hostinger support ticket filed

Root Cause

KVM Async Page Fault Stalls

Our VPS runs on a shared KVM hypervisor at Hostinger. The physical host machine is overcommitted — it allocates more virtual memory to guest VMs than it has physical RAM. When other VMs on the same host consume memory, the hypervisor pages out our VM’s memory to the host’s disk.

This is invisible to our VM. Our system reports 12GB of RAM free. But at the hypervisor level, those memory pages may not be backed by physical RAM.

The kernel logs told the full story:

INFO: task node:2794 blocked for more than 614 seconds.
  kvm_async_pf_task_wait_schedule+0x171/0x1b0
  __kvm_handle_async_pf+0x5c/0xe0
  exc_page_fault+0xb6/0x1b0

This call stack means:

  1. exc_page_fault — A process accessed a memory address not in physical RAM
  2. __kvm_handle_async_pf — The KVM hypervisor intercepted the fault — the page was swapped out at the host level
  3. kvm_async_pf_task_wait_schedule — The kernel put the process to sleep waiting for the host to fetch the page from disk
  4. blocked for 614 seconds — The host never fetched it fast enough. Process stuck for 10+ minutes

Why Redis Specifically?

Every 5 minutes, Redis performs an RDB snapshot by calling fork(). The child process shares the parent’s memory pages via copy-on-write and iterates through all of them to serialize data to disk. This is the one moment Redis touches all its memory pages at once.

On a healthy server, this takes milliseconds. On an overcommitted host, it’s a lottery. If the hypervisor has paged out any of those pages, the forked child hits a kvm_async_pf, enters D-state, and since it holds a lock on the RDB file, the parent process also stalls. Redis becomes completely unresponsive.

Normal server:
  fork() → touch all pages → pages in RAM → done in 80ms

Our VPS:
  fork() → touch all pages → page not in RAM
         → kvm_async_pf → wait for host disk
         → host disk is slow (shared) → wait forever
         → D-state → unkillable without reboot

What Didn’t Cause This

We initially suspected our Asynq cron jobs (thundering herd), but the data ruled it out:

Metric at freeze timeValueThundering herd?
CPU utilization85% idleNo — server was barely working
Redis memory6.84 MBNo — trivially small
Redis keys1,782No — tiny dataset
Redis clients17No — light load
Disk I/O0.3-0.5% utilNo — no disk pressure
Slowlog10-20ms dequeuesNo — normal Asynq operations

The freezes also didn’t correlate with cron schedules. They happened at 17:48, 16:58, and 10:18 — no alignment with our :00/:05/:15/:30/:45 cron windows.

A Hidden Bug: Health Check

During the first outage, our /status health check endpoint returned HTTP 200 while Redis was completely down. The code checked Redis health and logged the failure, but never set the isHealthy flag to false:

// Before: Redis failure was logged but didn't affect the response
if err := h.server.Redis.Ping(ctx).Err(); err != nil {
    checks["redis"] = map[string]interface{}{
        "status": "unhealthy",
        "error":  err.Error(),
    }
    // BUG: isHealthy was never set to false here
    logger.Error().Err(err).Msg("redis health check failed")
}

This meant our container orchestrator (Coolify) kept the backend running and routing traffic to it, even though it couldn’t serve Redis-dependent requests. Users saw 500 errors while monitoring showed “healthy.”

Resolution

Immediate (during incidents)

  • VPS reboots to recover from D-state (3 reboots over 4 days)
  • Disabled RDB snapshots (CONFIG SET save "") to eliminate the fork that triggers page faults
  • Installed Uptime Kuma with Redis TCP monitoring and Discord alerts for immediate notification

Code Fixes

  • Health check bug fix: Added isHealthy = false when Redis ping fails, so /status returns 503 and the orchestrator can detect and restart the container
  • Redis slowlog enabled: slowlog-log-slower-than 10000 for ongoing monitoring

Infrastructure

  • Filed support ticket with Hostinger including kvm_async_pf evidence, %steal metrics, and hung task logs
  • Hostinger confirmed host contention on shared VPS — no dedicated vCPU tier available on their platform
  • Installed sysstat for continuous I/O and CPU history collection

Action Items

  • Disable RDB saves — Eliminates the fork syscall that triggers kvm_async_pf during host contention
  • Fix health check — Redis failure now returns 503, enabling automatic container restart
  • Enable Redis slowlog — Captures evidence of slow commands for future diagnosis
  • Deploy Uptime Kuma — Redis TCP monitor + Discord notifications for instant alerting
  • Install sysstat — Host-level CPU/IO/memory history for forensic analysis
  • File Hostinger ticket — Documented with kernel evidence for host migration request
  • Migrate Redis to managed service — Upstash or similar to isolate Redis from VPS host issues
  • Evaluate hosting migration — Move to a provider with dedicated vCPU (Hetzner CX/CCX series) to eliminate %steal entirely
  • Deploy health check fix to production — Currently in dev branch, needs to ship
  • Add Redis connection circuit breaker — Fail fast on Redis-dependent endpoints instead of waiting for i/o timeout

Lessons Learned

  1. Shared VPS is unsuitable for latency-sensitive services. Redis, Asynq workers, and WebSocket servers are all sensitive to CPU steal and memory page faults. A shared hypervisor that overcommits physical resources will randomly freeze these services in ways that are impossible to predict or prevent from inside the VM. The %steal metric should be monitored as a first-class SLI.

  2. Health checks must cover all critical dependencies. Our health check logged the Redis failure but returned 200 anyway. For 17 hours during the first outage, our orchestrator thought the service was healthy. A health check that doesn’t fail when dependencies are down is worse than no health check — it provides false confidence.

  3. kvm_async_pf is invisible from inside the VM. None of our application metrics showed anything wrong before each freeze. CPU was idle, memory was free, disk was quiet. The only evidence was in dmesg (kernel page fault logs) and sar (%steal). If you’re on a shared VPS and processes randomly enter D-state, check these two things first.

  4. RDB fork is Redis’s Achilles heel on overcommitted hosts. The fork() + copy-on-write pattern that makes RDB snapshots efficient on dedicated hardware becomes a liability on shared infrastructure. The child process must touch every memory page, and if any page is swapped out at the host level, the entire process can freeze. Disabling RDB saves or using AOF (append-only file) avoids this entirely.

  5. Monitoring infrastructure should be independent of what it monitors. Our initial monitoring ran on the same VPS. When the VPS froze, we had no alerts. Adding Uptime Kuma with external health checks and Discord notifications meant we learned about the third outage within 30 seconds instead of hours.