SEV-1: Redis Freeze — KVM Host Contention on Shared VPS

Summary

Between February 21-24, 2026, the CoderDen platform experienced three separate full outages, each lasting 4-17 hours. All Asynq background task processing stopped, AI course generation returned 500 errors, and users couldn’t access features dependent on Redis. The root cause was KVM hypervisor memory overcommit on our shared VPS host — the physical server was paging out our VM’s memory, causing Redis to enter an unrecoverable kernel D-state (uninterruptible sleep) during RDB snapshot forks. Each outage required a full VPS reboot to recover.

Timeline

Feb 21, 17:48 UTC

Redis completes its last successful RDB snapshot. Logs stop — no crash, no error, no OOM. Process silently enters kernel D-state

Feb 21, 21:41 UTC

Backend starts logging 'dial tcp 10.0.1.9:6379: i/o timeout' across all Asynq queues. Health check returns 200 (bug — Redis failure wasn't flagged as unhealthy)

Feb 21, 21:45 UTC

User hits AI course generation endpoint, gets 500 — 'failed to store session in Redis'. User retries 4 times, all fail

Feb 21, 21:50 UTC

Engineer investigates. Redis container shows 'Up 12 hours (unhealthy)'. docker exec redis-cli ping hangs indefinitely

Feb 21, 21:55 UTC

docker kill and docker restart both hang — process is in kernel D-state, unkillable without reboot

Feb 21, 21:58 UTC

Full VPS reboot. All services recover. Redis loads last RDB snapshot from 17:48

Feb 23, 16:58 UTC

Second freeze. Identical pattern — Redis RDB save succeeds, then silence. Process enters D-state

Feb 24, 06:30 UTC

Second VPS reboot to recover

Feb 24, 10:18 UTC

Third freeze. Redis container shows 6,100 stuck PIDs, 0% CPU, 140MB memory. Completely frozen

Feb 24, 10:45 UTC

dmesg analysis reveals kvm_async_pf_task_wait_schedule — hypervisor page fault stalls. Root cause identified

Feb 24, 11:00 UTC

Third VPS reboot. RDB saves disabled to prevent future fork-triggered freezes. Hostinger support ticket filed

Root Cause

KVM Async Page Fault Stalls

Our VPS runs on a shared KVM hypervisor at Hostinger. The physical host machine is overcommitted — it allocates more virtual memory to guest VMs than it has physical RAM. When other VMs on the same host consume memory, the hypervisor pages out our VM’s memory to the host’s disk.

This is invisible to our VM. Our system reports 12GB of RAM free. But at the hypervisor level, those memory pages may not be backed by physical RAM.

The kernel logs told the full story:

INFO: task node:2794 blocked for more than 614 seconds.
  kvm_async_pf_task_wait_schedule+0x171/0x1b0
  __kvm_handle_async_pf+0x5c/0xe0
  exc_page_fault+0xb6/0x1b0

This call stack means:

exc_page_fault — A process accessed a memory address not in physical RAM
__kvm_handle_async_pf — The KVM hypervisor intercepted the fault — the page was swapped out at the host level
kvm_async_pf_task_wait_schedule — The kernel put the process to sleep waiting for the host to fetch the page from disk
blocked for 614 seconds — The host never fetched it fast enough. Process stuck for 10+ minutes

Why Redis Specifically?

Every 5 minutes, Redis performs an RDB snapshot by calling fork(). The child process shares the parent’s memory pages via copy-on-write and iterates through all of them to serialize data to disk. This is the one moment Redis touches all its memory pages at once.

On a healthy server, this takes milliseconds. On an overcommitted host, it’s a lottery. If the hypervisor has paged out any of those pages, the forked child hits a kvm_async_pf, enters D-state, and since it holds a lock on the RDB file, the parent process also stalls. Redis becomes completely unresponsive.

Normal server:
  fork() → touch all pages → pages in RAM → done in 80ms

Our VPS:
  fork() → touch all pages → page not in RAM
         → kvm_async_pf → wait for host disk
         → host disk is slow (shared) → wait forever
         → D-state → unkillable without reboot

What Didn’t Cause This

We initially suspected our Asynq cron jobs (thundering herd), but the data ruled it out:

Metric at freeze time	Value	Thundering herd?
CPU utilization	85% idle	No — server was barely working
Redis memory	6.84 MB	No — trivially small
Redis keys	1,782	No — tiny dataset
Redis clients	17	No — light load
Disk I/O	0.3-0.5% util	No — no disk pressure
Slowlog	10-20ms dequeues	No — normal Asynq operations

The freezes also didn’t correlate with cron schedules. They happened at 17:48, 16:58, and 10:18 — no alignment with our :00/:05/:15/:30/:45 cron windows.

A Hidden Bug: Health Check

During the first outage, our /status health check endpoint returned HTTP 200 while Redis was completely down. The code checked Redis health and logged the failure, but never set the isHealthy flag to false:

// Before: Redis failure was logged but didn't affect the response
if err := h.server.Redis.Ping(ctx).Err(); err != nil {
    checks["redis"] = map[string]interface{}{
        "status": "unhealthy",
        "error":  err.Error(),
    }
    // BUG: isHealthy was never set to false here
    logger.Error().Err(err).Msg("redis health check failed")
}

This meant our container orchestrator (Coolify) kept the backend running and routing traffic to it, even though it couldn’t serve Redis-dependent requests. Users saw 500 errors while monitoring showed “healthy.”

Resolution

Immediate (during incidents)

VPS reboots to recover from D-state (3 reboots over 4 days)
Disabled RDB snapshots (CONFIG SET save "") to eliminate the fork that triggers page faults
Installed Uptime Kuma with Redis TCP monitoring and Discord alerts for immediate notification

Code Fixes

Health check bug fix: Added isHealthy = false when Redis ping fails, so /status returns 503 and the orchestrator can detect and restart the container
Redis slowlog enabled: slowlog-log-slower-than 10000 for ongoing monitoring

Infrastructure

Filed support ticket with Hostinger including kvm_async_pf evidence, %steal metrics, and hung task logs
Hostinger confirmed host contention on shared VPS — no dedicated vCPU tier available on their platform
Installed sysstat for continuous I/O and CPU history collection

Action Items

Lessons Learned

Shared VPS is unsuitable for latency-sensitive services. Redis, Asynq workers, and WebSocket servers are all sensitive to CPU steal and memory page faults. A shared hypervisor that overcommits physical resources will randomly freeze these services in ways that are impossible to predict or prevent from inside the VM. The %steal metric should be monitored as a first-class SLI.
Health checks must cover all critical dependencies. Our health check logged the Redis failure but returned 200 anyway. For 17 hours during the first outage, our orchestrator thought the service was healthy. A health check that doesn’t fail when dependencies are down is worse than no health check — it provides false confidence.
kvm_async_pf is invisible from inside the VM. None of our application metrics showed anything wrong before each freeze. CPU was idle, memory was free, disk was quiet. The only evidence was in dmesg (kernel page fault logs) and sar (%steal). If you’re on a shared VPS and processes randomly enter D-state, check these two things first.
RDB fork is Redis’s Achilles heel on overcommitted hosts. The fork() + copy-on-write pattern that makes RDB snapshots efficient on dedicated hardware becomes a liability on shared infrastructure. The child process must touch every memory page, and if any page is swapped out at the host level, the entire process can freeze. Disabling RDB saves or using AOF (append-only file) avoids this entirely.
Monitoring infrastructure should be independent of what it monitors. Our initial monitoring ran on the same VPS. When the VPS froze, we had no alerts. Adding Uptime Kuma with external health checks and Discord notifications meant we learned about the third outage within 30 seconds instead of hours.