Two changes prompted by today's etcd raft panic on worker1-k8s0
(tocommit out of range, lost-write on follower) and the cascading
disk pressure that surfaced underneath it.
Audit logs to journald
- kube-apiserver now uses audit-log-path=- so audit events flow to
k3s.service stdout and into journald instead of growing files in
/var/log/kubernetes. The previous setup combined apiserver's
internal rotation with a logrotate *.log glob that double-rotated
the rotated files into permanent orphans (observed: 7+ GB).
- New journald-limits operation writes a SystemMaxUse=2G drop-in so
audit volume cannot fill /var/log even under bursty load.
- log-rotation operation repurposed to decommission the obsolete
logrotate rule and reap leftover audit files. Idempotent: no-op
on fresh installs.
Etcd member recovery
- New recoverEtcdMember(broken, peer, hostname) codifies the
documented k3s recovery: stop k3s, etcdctl member remove, wipe
/var/lib/rancher/k3s/server/{db,tls,cred}, restart, poll for
rejoin. Refuses to operate when cluster size < 3 to preserve
quorum.
Tests
- 7 new unit tests covering both decommission paths and the
recovery procedure (54 total, all green).
- install.test.ts asserts the file-based audit args are gone.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add `labctl provision recheck` to refresh hardware info via SSH
- Preserve hardware info in InstalledInfo when install completes
- Fix /ks-auto: run nested %pre scripts from included kickstarts
- Add command-discover WebSocket routing for hw info updates
- Fix k3s join: clean stale TLS/cred when joining existing cluster
- Add --tls-verify=false for internal HTTP registry pushes
- Add fix-ssh-root.sh script for root SSH access on all nodes
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Skip Cilium install for joining servers (already in cluster via daemonset)
- Longhorn annotation for workers: SSH to server node from CLI to apply
kubectl annotation (workers don't have kubectl access)
- Default SSH user for k3s/app commands changed to 'root' (operations
need root privileges, using 'lab' user broke installs)
- k3s server config: cluster-init for initial server, server+token for joins
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Server config now uses cluster-init: true for initial server (enables
embedded etcd). Joining servers get server: + token: in config.
- Cilium install already checks for existing installation, so joining
servers skip it gracefully (the "release name in use" error is non-fatal)
Cluster rebuilt as etcd HA:
worker0-k8s0 control-plane,etcd (initial server, cluster-init)
worker1-k8s0 control-plane,etcd (joined server, Mac Studio aarch64)
spark-2935 worker (DGX Spark, aarch64)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace sed with grep -v / awk for fstab manipulation (Asahi Fedora's
sed doesn't support \| delimiter or \? quantifier)
- Use idempotent write_lab_fstab function: removes all old entries first,
comments out conflicting btrfs subvol entries, adds fresh LVM entries
- Fix sed for SSH hardening: use #* instead of \? (POSIX compatible)
- Tested on Mac Studio: no duplicate fstab entries after multiple runs
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Firstboot script defaults admin user to 'lab' instead of bastion's
config.adminUser (which was 'michal' from host system)
- iSCSI OS detection uses case-insensitive match for 'fedora'
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
k3s host prep:
- Add iSCSI initiator install+enable (Fedora: iscsi-initiator-utils,
Ubuntu: open-iscsi) — required by Longhorn
- Add Longhorn disk label to k3s server+agent configs
- Add Longhorn disk annotation operation in post-install hardening
CLI:
- Add `labctl provision asahi` command with interactive install guide
- Change default SSH user from "michal" to "lab" in all commands
- Change admin user in bastion progress callback to "lab"
Asahi provisioning fixes:
- Download installer_data.json locally (installer reads it as file)
- Use REPO_BASE to serve upstream ZIP from bastion (LAN speed)
- Fix ZIP32 vs ZIP64: serve original upstream ZIP unmodified
(our repackaged ZIP used ZIP64 which breaks Asahi urlcache)
- Add /data/asahi-repo fallback path for k3s container PVC mount
- Deploy script syncs asahi-repo to bastion pod after deployment
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove hardcoded devices/directRoutingDevice from Cilium install (let
Cilium auto-detect per node — needed for heterogeneous NICs like eno1 vs enP7s7)
- Set k8sServiceHost=127.0.0.1 k8sServicePort=6444 so Cilium init
containers can reach the API via k3s agent's local LB proxy
- Add node-role.kubernetes.io/worker label to agent config
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Kickstart installs on real hardware failed silently — no error reporting,
only 3 progress callbacks, zero log streaming. This overhaul makes every
install fully observable.
Kickstart improvements:
- Error trapping in %pre and %post (trap ERR sends failure details to bastion)
- 12+ granular progress stages (was 3): SSH, hostname, k3s prep, EFI boot, metadata
- Background log streamer: tails %post output and batch-sends to /api/log
- bastion_log() function for explicit log lines from kickstart scripts
Bastion API:
- POST /api/log — receives raw log lines from kickstart (single or batch)
- InstallLogBuffer — per-MAC ring buffer (2000 lines) + file persistence
- GET /api/logs/:mac — now returns log_lines + log_total alongside stages
- SSE /api/logs/:mac/follow — uses named events (event: stage vs event: log)
- Progress events forwarded to labd via bastion-progress WebSocket message
- Post-provision k3s logs routed through progressBus (was console-only)
dnsmasq fixes found during VM testing:
- HTTP Boot filename: ipxe-real.efi → ipxe.efi (leftover from old 2-stage approach)
- pxe-service directives: only in proxy mode (breaks OVMF PXE in full mode)
- PXEClient vendor class echo for UEFI firmware compatibility
Integration tests:
- PXE boot test: blank UEFI VM → dnsmasq → HTTP Boot → iPXE → bastion → install
- ISO boot test: blank VM boots from bastion-generated ISO → same flow
- Shared helpers: pxe-network (no DHCP, nftables fix), pxe-vm (UEFI + ISO boot)
- test-provision.sh: runs both PXE + ISO tests with prerequisite checks
- 250GB sparse QCOW2 disk (LVM layout needs ~204GB)
201 unit tests passing (11 new).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>