fix(k3s): audit logs via journald + etcd recovery #13
Reference in New Issue
Block a user
Delete Branch "fix/k3s-audit-via-journald"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Two changes prompted by today's etcd raft panic on
worker1-k8s0(tocommit(19232321) is out of range [lastIndex(19232320)]) and the cascading disk pressure that surfaced underneath it.Audit logs through journald
kube-apiservernow usesaudit-log-path=-so audit events flow tok3s.servicestdout and into journald instead of growing files in/var/log/kubernetes.*.logglob that double-rotated rotated files into permanent orphans — observed at 7+ GB on worker0/labmaster.journald-limitsoperation writes aSystemMaxUse=2Gdrop-in so audit volume cannot fill/var/logeven under bursty load.log-rotationoperation repurposed to decommission the obsolete logrotate rule and reap leftover audit files. Idempotent on fresh installs.Etcd member recovery codified
recoverEtcdMember({broken, peer, brokenHostname})does the documented k3s recovery: stop k3s,etcdctl member remove, wipe/var/lib/rancher/k3s/server/{db,tls,cred}, restart, poll for rejoin.Already deployed live
The fix has been applied to the running cluster (worker0/1/2-k8s0, labmaster, spark-2935) — rolling k3s restart on the 3 control planes, all 3 etcd endpoints healthy, audit events confirmed flowing through
journalctl -u k3s.Disk reclaim from removing dead orphan files:
/var/log96% → 9%Test plan
vitest --run src/modules/modules/k3s— 54 pass, 0 fail (7 new tests cover both decommission paths and the recovery procedure)journalctl -u k3son all 4 control planes/var/log/kubernetesdirectory removed on all control planes/var/log/kubernetesabsent (not yet exercised)Two changes prompted by today's etcd raft panic on worker1-k8s0 (tocommit out of range, lost-write on follower) and the cascading disk pressure that surfaced underneath it. Audit logs to journald - kube-apiserver now uses audit-log-path=- so audit events flow to k3s.service stdout and into journald instead of growing files in /var/log/kubernetes. The previous setup combined apiserver's internal rotation with a logrotate *.log glob that double-rotated the rotated files into permanent orphans (observed: 7+ GB). - New journald-limits operation writes a SystemMaxUse=2G drop-in so audit volume cannot fill /var/log even under bursty load. - log-rotation operation repurposed to decommission the obsolete logrotate rule and reap leftover audit files. Idempotent: no-op on fresh installs. Etcd member recovery - New recoverEtcdMember(broken, peer, hostname) codifies the documented k3s recovery: stop k3s, etcdctl member remove, wipe /var/lib/rancher/k3s/server/{db,tls,cred}, restart, poll for rejoin. Refuses to operate when cluster size < 3 to preserve quorum. Tests - 7 new unit tests covering both decommission paths and the recovery procedure (54 total, all green). - install.test.ts asserts the file-based audit args are gone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>