Black box from ICARUS: what happens when a VPS drowns in its own garbage

April 21, 20269 min read

"If you're reading this, ICARUS is gone. This recording contains everything that went wrong and how to stop it happening to you. Don't ignore it."

— Last transmission, March 2026

What ICARUS was

Provider: Hostinger VPS
OS: Arch Linux
Specs: 2 vCPU, 8GB RAM, 100GB SSD
Main engine: k3s (lightweight Kubernetes)
Secondary engine: CapRover (Docker Swarm) — coexisting on the same host
Lifespan: August 2024 — March 2026

Running at the end: 35 pods, ~4.3GB RAM, 12 services. Poste.io email across 4 domains, two n8n instances (personal + client), KaraKEEP, Firefly III, Etebase (CalDAV / CardDAV), MongoDB, WireGuard, CloudBeaver, plus a custom Node.js API for a SaaS product. Domains served: rolandoahuja.com, centrocristianogosen.org, blindandosueños.com, solutions45.com.

How it died

Not from traffic. Not from an intrusion. Not from provider issues. ICARUS drowned in its own garbage.

After 18 months, df -h / reported 93% used — 90GB of 100GB. Panic. Except du -x -sh / reported only 23GB of real data. Something was counting things many times over.

That something was overlayfs snapshots.

The errors, in order of severity

1. Trusting `df` on a container host

df sees every overlayfs mount that k3s stacks up for every container. Every time a pod restarts, every time an image is pulled, new layers get mounted from /var/lib/rancher/k3s/agent/containerd/. The kernel reports those layers as belonging to /dev/sda3, so df counts them multiple times.

Reality: 23GB on disk. df reported 90GB. The gap was ghosts.

# Real usage
du -x -d1 -h / 2>/dev/null | sort -rh | head -15
# Image footprint
crictl images | sort -k3 -rh

2. 496 containerd snapshots nobody ever cleaned (14GB)

Every image pull — every n8n upgrade, every app bump — left the old layers as orphan snapshots in /var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/. k3s does not configure garbage collection by default. In 18 months: 496 snapshots, 14GB of dead weight, zero reclaim.

3. No log rotation

No /etc/logrotate.d/k3s
No /etc/logrotate.d/containerd
/var/lib/kubelet/ logs hit 1.4GB
No per-container log size limit

4. Pacman cache unbounded

/var/cache/pacman/pkg/ accumulated 1.6GB of downloaded packages. paccache.timer was never enabled.

5. Two swap files

Created one 512MB swap at install. Months later when RAM pressure hit, added a second 4GB swap. Now there were two, totaling 4.5GB. Wasted disk and wasted operational sanity.

6. `:latest` tags on images

Some manifests used image: xxx:latest. Why this is a disaster:

You can't tell what version is actually running without inspecting the container.
Every pull downloads a new version but leaves the old as an untagged, invisible snapshot.
No clean rollback is possible.

Apps that had :latest: Etebase, CloudBeaver, Firefly, Poste.io, WireGuard.

7. No monitoring, no alerts

The disk hit 93% (apparent) with zero warning. No monitoring cron. No alerts. No dashboards. Found out by accident while investigating something unrelated.

8. No backups

Zero backup CronJobs. Zero backup scripts. Every production piece of state — emails, workflow definitions, financial records, contacts — lived on one disk with no copy.

What I'd do differently, from day zero

Before installing anything

# Single swap, final size
fallocate -l 2G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
echo '/swapfile none swap defaults 0 0' >> /etc/fstab

# Pacman cache auto-clean
pacman -S pacman-contrib
systemctl enable --now paccache.timer

# Cap systemd journal
mkdir -p /etc/systemd/journald.conf.d
cat > /etc/systemd/journald.conf.d/size.conf << 'EOF'
[Journal]
SystemMaxUse=200M
EOF
systemctl restart systemd-journald

Configure k3s BEFORE installing it

Create the config before running the installer — k3s reads /etc/rancher/k3s/config.yaml on first boot:

disable:
  - traefik
  - servicelb
kubelet-arg:
  - "image-gc-high-threshold=80"
  - "image-gc-low-threshold=70"
  - "eviction-hard=nodefs.available<10%,imagefs.available<10%"
  - "container-log-max-files=3"
  - "container-log-max-size=10Mi"

image-gc-high-threshold=80 — when image disk hits 80%, start deleting old images
image-gc-low-threshold=70 — delete until below 70%
eviction-hard — if disk free goes below 10%, evict pods
container-log-max-files=3 + container-log-max-size=10Mi — cap per-container logs

Weekly cleanup cron

cat > /etc/cron.weekly/k3s-cleanup << 'SCRIPT'
#!/bin/bash
/usr/local/bin/k3s crictl rmi --prune 2>/dev/null
echo "$(date): k3s cleanup ran" >> /var/log/k3s-cleanup.log
SCRIPT
chmod +x /etc/cron.weekly/k3s-cleanup

Disk monitoring (using `du`, not `df`)

cat > /etc/cron.daily/disk-check << 'SCRIPT'
#!/bin/bash
USAGE_KB=$(du -x -s / 2>/dev/null | awk '{print $1}')
USAGE_GB=$((USAGE_KB / 1024 / 1024))
if [ "$USAGE_GB" -gt 75 ]; then
    echo "DISK ALERT: real usage ${USAGE_GB}GB / 100GB" | logger -t disk-alert
fi
SCRIPT
chmod +x /etc/cron.daily/disk-check

Pipe the alert to a Telegram bot, a webhook, or n8n if you want actual notifications instead of syslog entries.

Golden rule for manifests

Always pinned tags. Never :latest.

Before writing a manifest, look up the latest stable tag on Docker Hub and pin it:

# NEVER
image: victorrds/etebase:latest

# ALWAYS
image: victorrds/etebase:0.14.2

To upgrade: change the tag, kubectl apply. The old image gets garbage-collected because of the GC config from step 2.

Port exposure: the most expensive lesson

Standard ports (email, SSH): use `hostPort`

containers:
- name: poste
  ports:
  - containerPort: 25
    hostPort: 25       # binds directly on host interface
    name: smtp

k3s CNI adds the iptables DNAT automatically. Works immediately. Equivalent to Docker's -p 25:25.

Web services coexisting with another reverse proxy: use `NodePort`

service:
  type: NodePort
  ports:
  - port: 80
    nodePort: 31080    # high port, no collision

Never use NodePort plus manual iptables for standard ports

I tried this for email. It was a disaster. hostPort exists for exactly this case.

What should have been backed up (and wasn't)

All of this lived in /var/lib/rancher/k3s/storage/ as PVCs from the local-path provisioner:

Data	Real size	Priority
Poste.io data + DB (emails, DKIM, SSL, domains)	~200MB	CRITICAL
Firefly III MariaDB	~222MB	CRITICAL
n8n client Postgres (workflows, credentials)	~217MB	CRITICAL
n8n personal Postgres	~90MB	HIGH
MongoDB (client backend data)	~458MB	HIGH
Etebase (calendar, contacts)	~65MB	HIGH
KaraKEEP + Meilisearch	~97MB	MEDIUM

Total: ~1.4GB. A single daily CronJob tarring these PVCs to an off-site blob would have saved every critical piece of state. It never ran.

DKIM keys, SSL certs, and virtual domain config all live inside the poste-data-pvc volume (mounted at /data). Back up that PVC, you back up the full email identity. Restore it, emails don't land in spam.

PVC declared sizes (20Gi, 10Gi, 5Gi) are theoretical limits, not reservations. The local-path provisioner creates an empty directory. A "20Gi" PVC with 53MB of data occupies 53MB. Don't multiply declared sizes when budgeting disk.

Post-install checklist

Run this after you build the next one. Every line must say OK:

[ $(swapon --show --noheadings | wc -l) -eq 1 ] && echo "OK: single swap" || echo "FAIL: multiple swaps"
grep -q "image-gc" /etc/rancher/k3s/config.yaml && echo "OK: image GC" || echo "FAIL: no image GC"
grep -q "container-log-max" /etc/rancher/k3s/config.yaml && echo "OK: log limits" || echo "FAIL: no log limits"
[ -x /etc/cron.weekly/k3s-cleanup ] && echo "OK: cleanup cron" || echo "FAIL: no cleanup cron"
systemctl is-active paccache.timer &>/dev/null && echo "OK: paccache.timer" || echo "FAIL: paccache.timer inactive"
[ -f /etc/systemd/journald.conf.d/size.conf ] && echo "OK: journal capped" || echo "FAIL: journal uncapped"
[ -x /etc/cron.daily/disk-check ] && echo "OK: disk monitor" || echo "FAIL: no disk monitor"
grep -r "image:.*:latest" /root/k3s-manifests/ &>/dev/null && echo "FAIL: :latest tag present" || echo "OK: no :latest tags"

The black box became a checklist

The checklist above is what I said I'd do. The interesting part is what actually shipped on the replacement host. When the backend services moved off ICARUS to a fresh VPS, I didn't leave the lessons as a markdown file I'd forget to read. I turned every failure mode into a systemd timer that runs whether or not I'm paying attention.

Three of those errors — phantom disk space, version bloat, and stale container images — each got a dedicated unit. They all run on idle scheduling: Nice=19 plus idle CPU and I/O priority, so they never compete with anything a paying client is touching. Hygiene shouldn't cost you a latency spike.

disk-check — never trust `df` again

This is error #1 wired into a daemon. The whole post-mortem hinges on df reporting 90GB while du saw 23GB. So the check stopped trusting either number alone and started watching the gap between them:

# /usr/local/bin/disk-check — conceptual shape
DU_KB=$(du -x -s / 2>/dev/null | awk '{print $1}')
DF_PCT=$(df --output=pcent / | tail -1 | tr -dc '0-9')
DU_GB=$((DU_KB / 1024 / 1024))

# Phantom space = df says full, du says it isn't.
# Deleted-but-open files, overlayfs layers, orphan snapshots.
if [ "$DU_GB" -gt 75 ] || [ "$DF_PCT" -gt 85 ]; then
    logger -t disk-check "ALERT: du=${DU_GB}GB df=${DF_PCT}% — investigate the delta"
fi

du walks real files. df reports what the kernel thinks the filesystem holds, including every overlayfs mount and every deleted file still pinned open by a process. When the two disagree, that disagreement is the diagnosis — it's exactly the ghost that killed ICARUS, caught the moment it appears instead of 18 months later. Daily timer, RandomizedDelaySec so a fleet of hosts doesn't all wake at once.

claude-prune — keep the latest, kill the rest

Claude Code drops a fresh self-update into ~/.local/share/claude/versions/ and never deletes the old binaries. Each one is hundreds of MB. Left alone for a year, that directory becomes its own little ICARUS: invisible growth, nobody watching.

The script keeps exactly two classes of version and deletes everything else:

# /usr/local/bin/claude-prune — conceptual shape
VERSIONS_DIR="$HOME/.local/share/claude/versions"

# In-use: any binary currently mapped by a running process
in_use=$(for exe in /proc/*/exe; do readlink -f "$exe" 2>/dev/null; done \
         | grep "$VERSIONS_DIR" | sort -u)

# Latest: newest by mtime, always retained
latest=$(ls -t "$VERSIONS_DIR" | head -1)

# Delete anything that's neither the latest nor actively running

Same rule as the manifests: keep what you can name and what's actually live, garbage-collect the rest. A running session never loses the binary out from under it, because the /proc/*/exe scan pins it.

podman-prune — hourly image GC, free

Error #2 was 496 orphan snapshots because k3s never garbage-collected. The new host runs Podman, not k3s, but the failure mode is identical: every image pull leaves the old layers untagged and invisible. So podman image prune -af runs hourly (OnCalendar=hourly with a randomized delay), again at Nice=19. Old layers never get a chance to pile up into a 14GB surprise, because nothing survives more than 60 minutes of being unreferenced.

It's the GC config from the k3s section, made unconditional and put on a clock.

Why timers, not cron, not a runbook

A runbook is a promise to your future self that you'll remember to run something. ICARUS is what that promise is worth after 18 months. systemd timers are the opposite contract: they fire on schedule, they log to the journal, they survive reboots (Persistent=true catches up on missed runs), and the idle scheduling means the cost of being careful rounds to zero. The black box recording turned into four unattended units doing exactly what the recording begged the next operator to do.

"ICARUS didn't die from lack of power. It drowned in its own trash — ghost snapshots, unrotated logs, and the illusion of a full disk that wasn't. Don't let it happen to you."