SEV-1: Redis Freeze — KVM Host Contention on Shared VPS
Post-mortem of the recurring Redis outages from Feb 21-24, 2026 caused by KVM hypervisor memory overcommit on a shared VPS, freezing Redis in kernel D-state and blocking all Asynq task processing.
CoderDen Team
Engineering
Summary
Between February 21-24, 2026, the CoderDen platform experienced three separate full outages, each lasting 4-17 hours. All Asynq background task processing stopped, AI course generation returned 500 errors, and users couldn’t access features dependent on Redis. The root cause was KVM hypervisor memory overcommit on our shared VPS host — the physical server was paging out our VM’s memory, causing Redis to enter an unrecoverable kernel D-state (uninterruptible sleep) during RDB snapshot forks. Each outage required a full VPS reboot to recover.
Timeline
Redis completes its last successful RDB snapshot. Logs stop — no crash, no error, no OOM. Process silently enters kernel D-state
Backend starts logging 'dial tcp 10.0.1.9:6379: i/o timeout' across all Asynq queues. Health check returns 200 (bug — Redis failure wasn't flagged as unhealthy)
User hits AI course generation endpoint, gets 500 — 'failed to store session in Redis'. User retries 4 times, all fail
Engineer investigates. Redis container shows 'Up 12 hours (unhealthy)'. docker exec redis-cli ping hangs indefinitely
docker kill and docker restart both hang — process is in kernel D-state, unkillable without reboot
Full VPS reboot. All services recover. Redis loads last RDB snapshot from 17:48
Second freeze. Identical pattern — Redis RDB save succeeds, then silence. Process enters D-state
Second VPS reboot to recover
Third freeze. Redis container shows 6,100 stuck PIDs, 0% CPU, 140MB memory. Completely frozen
dmesg analysis reveals kvm_async_pf_task_wait_schedule — hypervisor page fault stalls. Root cause identified
Third VPS reboot. RDB saves disabled to prevent future fork-triggered freezes. Hostinger support ticket filed
Root Cause
KVM Async Page Fault Stalls
Our VPS runs on a shared KVM hypervisor at Hostinger. The physical host machine is overcommitted — it allocates more virtual memory to guest VMs than it has physical RAM. When other VMs on the same host consume memory, the hypervisor pages out our VM’s memory to the host’s disk.
This is invisible to our VM. Our system reports 12GB of RAM free. But at the hypervisor level, those memory pages may not be backed by physical RAM.
The kernel logs told the full story:
INFO: task node:2794 blocked for more than 614 seconds.
kvm_async_pf_task_wait_schedule+0x171/0x1b0
__kvm_handle_async_pf+0x5c/0xe0
exc_page_fault+0xb6/0x1b0
This call stack means:
exc_page_fault— A process accessed a memory address not in physical RAM__kvm_handle_async_pf— The KVM hypervisor intercepted the fault — the page was swapped out at the host levelkvm_async_pf_task_wait_schedule— The kernel put the process to sleep waiting for the host to fetch the page from diskblocked for 614 seconds— The host never fetched it fast enough. Process stuck for 10+ minutes
Why Redis Specifically?
Every 5 minutes, Redis performs an RDB snapshot by calling fork(). The child process shares the parent’s memory pages via copy-on-write and iterates through all of them to serialize data to disk. This is the one moment Redis touches all its memory pages at once.
On a healthy server, this takes milliseconds. On an overcommitted host, it’s a lottery. If the hypervisor has paged out any of those pages, the forked child hits a kvm_async_pf, enters D-state, and since it holds a lock on the RDB file, the parent process also stalls. Redis becomes completely unresponsive.
Normal server:
fork() → touch all pages → pages in RAM → done in 80ms
Our VPS:
fork() → touch all pages → page not in RAM
→ kvm_async_pf → wait for host disk
→ host disk is slow (shared) → wait forever
→ D-state → unkillable without reboot
What Didn’t Cause This
We initially suspected our Asynq cron jobs (thundering herd), but the data ruled it out:
| Metric at freeze time | Value | Thundering herd? |
|---|---|---|
| CPU utilization | 85% idle | No — server was barely working |
| Redis memory | 6.84 MB | No — trivially small |
| Redis keys | 1,782 | No — tiny dataset |
| Redis clients | 17 | No — light load |
| Disk I/O | 0.3-0.5% util | No — no disk pressure |
| Slowlog | 10-20ms dequeues | No — normal Asynq operations |
The freezes also didn’t correlate with cron schedules. They happened at 17:48, 16:58, and 10:18 — no alignment with our :00/:05/:15/:30/:45 cron windows.
A Hidden Bug: Health Check
During the first outage, our /status health check endpoint returned HTTP 200 while Redis was completely down. The code checked Redis health and logged the failure, but never set the isHealthy flag to false:
// Before: Redis failure was logged but didn't affect the response
if err := h.server.Redis.Ping(ctx).Err(); err != nil {
checks["redis"] = map[string]interface{}{
"status": "unhealthy",
"error": err.Error(),
}
// BUG: isHealthy was never set to false here
logger.Error().Err(err).Msg("redis health check failed")
}
This meant our container orchestrator (Coolify) kept the backend running and routing traffic to it, even though it couldn’t serve Redis-dependent requests. Users saw 500 errors while monitoring showed “healthy.”
Resolution
Immediate (during incidents)
- VPS reboots to recover from D-state (3 reboots over 4 days)
- Disabled RDB snapshots (
CONFIG SET save "") to eliminate the fork that triggers page faults - Installed Uptime Kuma with Redis TCP monitoring and Discord alerts for immediate notification
Code Fixes
- Health check bug fix: Added
isHealthy = falsewhen Redis ping fails, so/statusreturns 503 and the orchestrator can detect and restart the container - Redis slowlog enabled:
slowlog-log-slower-than 10000for ongoing monitoring
Infrastructure
- Filed support ticket with Hostinger including
kvm_async_pfevidence,%stealmetrics, and hung task logs - Hostinger confirmed host contention on shared VPS — no dedicated vCPU tier available on their platform
- Installed
sysstatfor continuous I/O and CPU history collection
Action Items
- Disable RDB saves — Eliminates the fork syscall that triggers
kvm_async_pfduring host contention - Fix health check — Redis failure now returns 503, enabling automatic container restart
- Enable Redis slowlog — Captures evidence of slow commands for future diagnosis
- Deploy Uptime Kuma — Redis TCP monitor + Discord notifications for instant alerting
- Install sysstat — Host-level CPU/IO/memory history for forensic analysis
- File Hostinger ticket — Documented with kernel evidence for host migration request
- Migrate Redis to managed service — Upstash or similar to isolate Redis from VPS host issues
- Evaluate hosting migration — Move to a provider with dedicated vCPU (Hetzner CX/CCX series) to eliminate
%stealentirely - Deploy health check fix to production — Currently in dev branch, needs to ship
- Add Redis connection circuit breaker — Fail fast on Redis-dependent endpoints instead of waiting for i/o timeout
Lessons Learned
-
Shared VPS is unsuitable for latency-sensitive services. Redis, Asynq workers, and WebSocket servers are all sensitive to CPU steal and memory page faults. A shared hypervisor that overcommits physical resources will randomly freeze these services in ways that are impossible to predict or prevent from inside the VM. The
%stealmetric should be monitored as a first-class SLI. -
Health checks must cover all critical dependencies. Our health check logged the Redis failure but returned 200 anyway. For 17 hours during the first outage, our orchestrator thought the service was healthy. A health check that doesn’t fail when dependencies are down is worse than no health check — it provides false confidence.
-
kvm_async_pfis invisible from inside the VM. None of our application metrics showed anything wrong before each freeze. CPU was idle, memory was free, disk was quiet. The only evidence was indmesg(kernel page fault logs) andsar(%steal). If you’re on a shared VPS and processes randomly enter D-state, check these two things first. -
RDB fork is Redis’s Achilles heel on overcommitted hosts. The
fork()+ copy-on-write pattern that makes RDB snapshots efficient on dedicated hardware becomes a liability on shared infrastructure. The child process must touch every memory page, and if any page is swapped out at the host level, the entire process can freeze. Disabling RDB saves or using AOF (append-only file) avoids this entirely. -
Monitoring infrastructure should be independent of what it monitors. Our initial monitoring ran on the same VPS. When the VPS froze, we had no alerts. Adding Uptime Kuma with external health checks and Discord notifications meant we learned about the third outage within 30 seconds instead of hours.