SEV-1: Redis AOF Disk Starvation — Unfinished Action Items Strike Back
Post-mortem of the March 1, 2026 outage caused by Redis AOF fsync blocking on slow shared disk I/O, escalating over 4 days until the VPS became completely unreachable. A direct consequence of incomplete remediation from the February KVM host contention incident.
CoderDen Team
Engineering
Summary
On March 1, 2026, the CoderDen VPS became completely unreachable, requiring a hard reboot. The root cause was Redis AOF (Append-Only File) fsync operations blocking on slow shared disk I/O, compounding with RDB snapshot forks that were supposed to have been disabled after the February KVM host contention incident. The save "" config from February was never persisted to the Redis container command — it was lost on a routine container restart. Over 4 days, disk I/O contention escalated from occasional slow fsyncs to sustained blocking, eventually freezing Redis and the entire VPS.
Timeline
First AOF fsync warnings appear: 'Asynchronous AOF fsync is taking too long (disk is busy?)'. Two occurrences at midnight
Two more slow fsyncs at midnight. Five additional warnings during the day (12:34-13:53 UTC). Pattern is worsening
Two slow fsyncs at midnight. Then 14 consecutive slow fsyncs from 02:37-02:51 UTC — a 15-minute sustained disk I/O stall
Eight consecutive slow fsyncs in the first 2 minutes of midnight (00:00-00:01). Most aggressive cluster yet
VPS becomes completely unreachable. SSH hangs, no response to pings. Redis likely entered D-state during combined AOF fsync + RDB fork
Hard reboot from Hostinger panel. All containers restart. Redis loads from AOF in 0.6 seconds
Post-recovery: Redis immediately triggers RDB save — '10000 changes in 60 seconds. Saving...' — confirming RDB snapshots were never permanently disabled
Applied runtime fixes: disabled RDB saves, set appendfsync to 'no', raised AOF rewrite threshold to 500%. Persisted via Coolify Redis config
Root Cause
Three compounding factors
1. RDB snapshots were never permanently disabled.
After the February incident, we ran CONFIG SET save "" at runtime to disable RDB snapshots. But the Redis container’s startup command was still:
redis-server --requirepass <password> --appendonly yes
No --save "" flag. When the container restarted (routine redeployment, host maintenance, etc.), RDB snapshots silently re-enabled with the default save 3600 1 300 100 60 10000 policy. The post-recovery log confirmed this:
14:07:45.076 * 10000 changes in 60 seconds. Saving...
14:07:45.077 * Background saving started by pid 96
2. AOF fsync was blocking on slow shared disk.
With appendfsync everysec (the default), Redis calls fsync() every second to flush the AOF buffer to disk. On our shared KVM host with overcommitted I/O, this frequently took longer than expected. Redis logs the warning when fsync takes over 2 seconds:
Asynchronous AOF fsync is taking too long (disk is busy?).
Writing the AOF buffer without waiting for fsync to complete,
this may slow down Redis.
The escalation pattern over 4 days shows the disk I/O contention worsening:
| Date | Slow fsyncs | Pattern |
|---|---|---|
| Feb 26 | 2 | Midnight only |
| Feb 27 | 7 | Midnight + daytime (12:34-13:53) |
| Feb 28 | 16 | Midnight + sustained 15-min burst (02:37-02:51) |
| Mar 1 | 8+ | 8 in first 2 minutes of midnight, then VPS died |
3. AOF rewrite forks compounded the pressure.
Redis was auto-rewriting the AOF file at ~2500% growth (default threshold: 100%). Each rewrite calls fork(), creating a child process that reads all memory pages — the same mechanism that caused the February D-state freezes. Combined with fsync blocking and RDB snapshot forks, three separate forking operations competed for the same slow disk.
Resolution
Runtime fixes (applied immediately after reboot)
CONFIG SET save "" # Disable RDB snapshots
CONFIG SET appendfsync no # Let OS handle fsync timing
CONFIG SET auto-aof-rewrite-percentage 500 # Reduce rewrite frequency
Persistent fix (applied via Coolify)
Updated the Redis service configuration in Coolify’s custom config field:
save ""
appendfsync no
auto-aof-rewrite-percentage 500
Coolify automatically applies requirepass from the password field, so these three directives are all that’s needed. The settings now survive container restarts.
What each setting does
| Setting | Before | After | Why |
|---|---|---|---|
save | 3600 1 300 100 60 10000 | "" (disabled) | Eliminates RDB fork — the primary trigger for D-state on overcommitted hosts |
appendfsync | everysec | no | Stops Redis from blocking on slow disk. OS flushes every ~30s instead. Trade-off: up to 30s of data loss on hard crash vs 1s. Acceptable for our workload (task queues, caching) |
auto-aof-rewrite-percentage | 100 | 500 | AOF rewrites less frequently, reducing fork frequency. AOF must grow 5x before compaction instead of 2x |
What Didn’t Cause This
| Hypothesis | Evidence against |
|---|---|
| Application bug / new deployment | Container image unchanged for 7 hours before crash |
| Memory exhaustion | Redis using 6.78 MB at recovery. VPS had ample free memory |
| Thundering herd cron | Cron schedules were staggered after February fix |
| Network issue | VPS was unreachable at OS level, not just app level |
Action Items
- Disable RDB saves persistently — Via Coolify custom Redis config, not just runtime
CONFIG SET - Set appendfsync to no — Eliminates fsync blocking on slow shared disk
- Raise AOF rewrite threshold — 500% growth before fork, reducing compaction frequency
- Persist config via Coolify — Settings survive container restarts automatically
- Add monitoring for Redis persistence warnings — Alert on “AOF fsync is taking too long” log pattern via Uptime Kuma or Loki
- Migrate Redis to managed service — Upstash or Railway to isolate from VPS disk I/O entirely (carried over from February)
- Evaluate hosting migration — Dedicated vCPU hosting (Hetzner CCX) to eliminate shared disk contention (carried over from February)
- Add config drift detection — Monitor that critical Redis settings (
save,appendfsync) remain as expected after restarts
Lessons Learned
-
Runtime config changes are not fixes.
CONFIG SETwithoutCONFIG REWRITEor persistent container args is a ticking time bomb. Any restart — planned deployment, host maintenance, OOM kill — silently reverts the config. If a runtime change fixed a SEV-1, it must be persisted in the same incident response, not added to a backlog. -
Action items from incidents must be tracked to completion. The February post-mortem listed “Disable RDB saves” as completed with a checkmark. It was completed at runtime, but never persisted. The difference between “we ran the command” and “this will survive a restart” is the difference between a fix and a temporary workaround.
-
Escalating warnings are a leading indicator. The AOF fsync warnings followed a clear 4-day escalation pattern: 2 → 7 → 16 → crash. If we had been alerting on these warnings, we could have intervened before the VPS became unreachable. Log-based alerting on infrastructure warnings is cheap insurance.
-
Shared disk I/O is the constraint, not memory or CPU. The February incident focused on
kvm_async_pf(memory page faults). This incident was pure disk I/O contention —fsync()blocking because other tenants on the same physical host were saturating the disk. On shared infrastructure, every resource is a shared resource.