SEV-1

SEV-1: Redis AOF Disk Starvation — Unfinished Action Items Strike Back

Post-mortem of the March 1, 2026 outage caused by Redis AOF fsync blocking on slow shared disk I/O, escalating over 4 days until the VPS became completely unreachable. A direct consequence of incomplete remediation from the February KVM host contention incident.

C

CoderDen Team

Engineering

incident-report redis infrastructure reliability devops

Summary

On March 1, 2026, the CoderDen VPS became completely unreachable, requiring a hard reboot. The root cause was Redis AOF (Append-Only File) fsync operations blocking on slow shared disk I/O, compounding with RDB snapshot forks that were supposed to have been disabled after the February KVM host contention incident. The save "" config from February was never persisted to the Redis container command — it was lost on a routine container restart. Over 4 days, disk I/O contention escalated from occasional slow fsyncs to sustained blocking, eventually freezing Redis and the entire VPS.

Timeline

First AOF fsync warnings appear: 'Asynchronous AOF fsync is taking too long (disk is busy?)'. Two occurrences at midnight

Two more slow fsyncs at midnight. Five additional warnings during the day (12:34-13:53 UTC). Pattern is worsening

Two slow fsyncs at midnight. Then 14 consecutive slow fsyncs from 02:37-02:51 UTC — a 15-minute sustained disk I/O stall

Eight consecutive slow fsyncs in the first 2 minutes of midnight (00:00-00:01). Most aggressive cluster yet

VPS becomes completely unreachable. SSH hangs, no response to pings. Redis likely entered D-state during combined AOF fsync + RDB fork

Hard reboot from Hostinger panel. All containers restart. Redis loads from AOF in 0.6 seconds

Post-recovery: Redis immediately triggers RDB save — '10000 changes in 60 seconds. Saving...' — confirming RDB snapshots were never permanently disabled

Applied runtime fixes: disabled RDB saves, set appendfsync to 'no', raised AOF rewrite threshold to 500%. Persisted via Coolify Redis config

Root Cause

Three compounding factors

1. RDB snapshots were never permanently disabled.

After the February incident, we ran CONFIG SET save "" at runtime to disable RDB snapshots. But the Redis container’s startup command was still:

redis-server --requirepass <password> --appendonly yes

No --save "" flag. When the container restarted (routine redeployment, host maintenance, etc.), RDB snapshots silently re-enabled with the default save 3600 1 300 100 60 10000 policy. The post-recovery log confirmed this:

14:07:45.076 * 10000 changes in 60 seconds. Saving...
14:07:45.077 * Background saving started by pid 96

2. AOF fsync was blocking on slow shared disk.

With appendfsync everysec (the default), Redis calls fsync() every second to flush the AOF buffer to disk. On our shared KVM host with overcommitted I/O, this frequently took longer than expected. Redis logs the warning when fsync takes over 2 seconds:

Asynchronous AOF fsync is taking too long (disk is busy?).
Writing the AOF buffer without waiting for fsync to complete,
this may slow down Redis.

The escalation pattern over 4 days shows the disk I/O contention worsening:

DateSlow fsyncsPattern
Feb 262Midnight only
Feb 277Midnight + daytime (12:34-13:53)
Feb 2816Midnight + sustained 15-min burst (02:37-02:51)
Mar 18+8 in first 2 minutes of midnight, then VPS died

3. AOF rewrite forks compounded the pressure.

Redis was auto-rewriting the AOF file at ~2500% growth (default threshold: 100%). Each rewrite calls fork(), creating a child process that reads all memory pages — the same mechanism that caused the February D-state freezes. Combined with fsync blocking and RDB snapshot forks, three separate forking operations competed for the same slow disk.

Resolution

Runtime fixes (applied immediately after reboot)

CONFIG SET save ""              # Disable RDB snapshots
CONFIG SET appendfsync no       # Let OS handle fsync timing
CONFIG SET auto-aof-rewrite-percentage 500  # Reduce rewrite frequency

Persistent fix (applied via Coolify)

Updated the Redis service configuration in Coolify’s custom config field:

save ""
appendfsync no
auto-aof-rewrite-percentage 500

Coolify automatically applies requirepass from the password field, so these three directives are all that’s needed. The settings now survive container restarts.

What each setting does

SettingBeforeAfterWhy
save3600 1 300 100 60 10000"" (disabled)Eliminates RDB fork — the primary trigger for D-state on overcommitted hosts
appendfsynceverysecnoStops Redis from blocking on slow disk. OS flushes every ~30s instead. Trade-off: up to 30s of data loss on hard crash vs 1s. Acceptable for our workload (task queues, caching)
auto-aof-rewrite-percentage100500AOF rewrites less frequently, reducing fork frequency. AOF must grow 5x before compaction instead of 2x

What Didn’t Cause This

HypothesisEvidence against
Application bug / new deploymentContainer image unchanged for 7 hours before crash
Memory exhaustionRedis using 6.78 MB at recovery. VPS had ample free memory
Thundering herd cronCron schedules were staggered after February fix
Network issueVPS was unreachable at OS level, not just app level

Action Items

  • Disable RDB saves persistently — Via Coolify custom Redis config, not just runtime CONFIG SET
  • Set appendfsync to no — Eliminates fsync blocking on slow shared disk
  • Raise AOF rewrite threshold — 500% growth before fork, reducing compaction frequency
  • Persist config via Coolify — Settings survive container restarts automatically
  • Add monitoring for Redis persistence warnings — Alert on “AOF fsync is taking too long” log pattern via Uptime Kuma or Loki
  • Migrate Redis to managed service — Upstash or Railway to isolate from VPS disk I/O entirely (carried over from February)
  • Evaluate hosting migration — Dedicated vCPU hosting (Hetzner CCX) to eliminate shared disk contention (carried over from February)
  • Add config drift detection — Monitor that critical Redis settings (save, appendfsync) remain as expected after restarts

Lessons Learned

  1. Runtime config changes are not fixes. CONFIG SET without CONFIG REWRITE or persistent container args is a ticking time bomb. Any restart — planned deployment, host maintenance, OOM kill — silently reverts the config. If a runtime change fixed a SEV-1, it must be persisted in the same incident response, not added to a backlog.

  2. Action items from incidents must be tracked to completion. The February post-mortem listed “Disable RDB saves” as completed with a checkmark. It was completed at runtime, but never persisted. The difference between “we ran the command” and “this will survive a restart” is the difference between a fix and a temporary workaround.

  3. Escalating warnings are a leading indicator. The AOF fsync warnings followed a clear 4-day escalation pattern: 2 → 7 → 16 → crash. If we had been alerting on these warnings, we could have intervened before the VPS became unreachable. Log-based alerting on infrastructure warnings is cheap insurance.

  4. Shared disk I/O is the constraint, not memory or CPU. The February incident focused on kvm_async_pf (memory page faults). This incident was pure disk I/O contention — fsync() blocking because other tenants on the same physical host were saturating the disk. On shared infrastructure, every resource is a shared resource.