Commit Graph

9 Commits

Author SHA1 Message Date
Michal
dd92147341 fix(k3s): route audit logs through journald, codify etcd member recovery
Some checks failed
CI/CD / typecheck (pull_request) Failing after 13s
CI/CD / lint (pull_request) Failing after 23s
CI/CD / test (pull_request) Failing after 10s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Two changes prompted by today's etcd raft panic on worker1-k8s0
(tocommit out of range, lost-write on follower) and the cascading
disk pressure that surfaced underneath it.

Audit logs to journald
- kube-apiserver now uses audit-log-path=- so audit events flow to
  k3s.service stdout and into journald instead of growing files in
  /var/log/kubernetes. The previous setup combined apiserver's
  internal rotation with a logrotate *.log glob that double-rotated
  the rotated files into permanent orphans (observed: 7+ GB).
- New journald-limits operation writes a SystemMaxUse=2G drop-in so
  audit volume cannot fill /var/log even under bursty load.
- log-rotation operation repurposed to decommission the obsolete
  logrotate rule and reap leftover audit files. Idempotent: no-op
  on fresh installs.

Etcd member recovery
- New recoverEtcdMember(broken, peer, hostname) codifies the
  documented k3s recovery: stop k3s, etcdctl member remove, wipe
  /var/lib/rancher/k3s/server/{db,tls,cred}, restart, poll for
  rejoin. Refuses to operate when cluster size < 3 to preserve
  quorum.

Tests
- 7 new unit tests covering both decommission paths and the
  recovery procedure (54 total, all green).
- install.test.ts asserts the file-based audit args are gone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 21:29:16 +01:00
Michal
9ddab24931 feat: provision recheck, hardware info preservation, ISO boot fixes
Some checks failed
CI/CD / lint (pull_request) Failing after 1m26s
CI/CD / typecheck (pull_request) Failing after 11s
CI/CD / test (pull_request) Failing after 11s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
- Add `labctl provision recheck` to refresh hardware info via SSH
- Preserve hardware info in InstalledInfo when install completes
- Fix /ks-auto: run nested %pre scripts from included kickstarts
- Add command-discover WebSocket routing for hw info updates
- Fix k3s join: clean stale TLS/cred when joining existing cluster
- Add --tls-verify=false for internal HTTP registry pushes
- Add fix-ssh-root.sh script for root SSH access on all nodes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 17:59:39 +01:00
Michal
06fc40a857 fix: k3s install automation — skip Cilium on join, Longhorn via server, default root user
Some checks failed
CI/CD / typecheck (push) Failing after 10s
CI/CD / test (push) Failing after 9s
CI/CD / lint (push) Failing after 22s
CI/CD / build (push) Has been skipped
CI/CD / publish-rpm (push) Has been skipped
CI/CD / publish-deb (push) Has been skipped
- Skip Cilium install for joining servers (already in cluster via daemonset)
- Longhorn annotation for workers: SSH to server node from CLI to apply
  kubectl annotation (workers don't have kubectl access)
- Default SSH user for k3s/app commands changed to 'root' (operations
  need root privileges, using 'lab' user broke installs)
- k3s server config: cluster-init for initial server, server+token for joins

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 16:02:19 +01:00
Michal
a68d6d617e feat: k3s cluster-init for etcd HA, fix Cilium duplicate install
Some checks failed
CI/CD / lint (push) Failing after 11s
CI/CD / test (push) Failing after 10s
CI/CD / typecheck (push) Failing after 22s
CI/CD / build (push) Has been skipped
CI/CD / publish-rpm (push) Has been skipped
CI/CD / publish-deb (push) Has been skipped
- Server config now uses cluster-init: true for initial server (enables
  embedded etcd). Joining servers get server: + token: in config.
- Cilium install already checks for existing installation, so joining
  servers skip it gracefully (the "release name in use" error is non-fatal)

Cluster rebuilt as etcd HA:
  worker0-k8s0  control-plane,etcd  (initial server, cluster-init)
  worker1-k8s0  control-plane,etcd  (joined server, Mac Studio aarch64)
  spark-2935    worker              (DGX Spark, aarch64)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 15:53:18 +01:00
Michal
c49a650888 fix: firstboot fstab handling — no duplicates, compatible with Asahi sed
Some checks failed
CI/CD / typecheck (push) Failing after 10s
CI/CD / test (push) Failing after 11s
CI/CD / lint (push) Failing after 23s
CI/CD / build (push) Has been skipped
CI/CD / publish-rpm (push) Has been skipped
CI/CD / publish-deb (push) Has been skipped
- Replace sed with grep -v / awk for fstab manipulation (Asahi Fedora's
  sed doesn't support \| delimiter or \? quantifier)
- Use idempotent write_lab_fstab function: removes all old entries first,
  comments out conflicting btrfs subvol entries, adds fresh LVM entries
- Fix sed for SSH hardening: use #* instead of \? (POSIX compatible)
- Tested on Mac Studio: no duplicate fstab entries after multiple runs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 15:40:29 +01:00
Michal
87e09af941 fix: default admin user to 'lab', case-insensitive OS detection for iSCSI
Some checks failed
CI/CD / typecheck (push) Failing after 10s
CI/CD / test (push) Failing after 10s
CI/CD / lint (push) Failing after 22s
CI/CD / build (push) Has been skipped
CI/CD / publish-rpm (push) Has been skipped
CI/CD / publish-deb (push) Has been skipped
- Firstboot script defaults admin user to 'lab' instead of bastion's
  config.adminUser (which was 'michal' from host system)
- iSCSI OS detection uses case-insensitive match for 'fedora'

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 15:13:53 +01:00
Michal
bb8f37ef7d feat: iSCSI, Longhorn disk labels, labctl asahi command, ZIP32 fix
Some checks failed
CI/CD / typecheck (pull_request) Failing after 12s
CI/CD / lint (pull_request) Failing after 22s
CI/CD / test (pull_request) Failing after 10s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
k3s host prep:
- Add iSCSI initiator install+enable (Fedora: iscsi-initiator-utils,
  Ubuntu: open-iscsi) — required by Longhorn
- Add Longhorn disk label to k3s server+agent configs
- Add Longhorn disk annotation operation in post-install hardening

CLI:
- Add `labctl provision asahi` command with interactive install guide
- Change default SSH user from "michal" to "lab" in all commands
- Change admin user in bastion progress callback to "lab"

Asahi provisioning fixes:
- Download installer_data.json locally (installer reads it as file)
- Use REPO_BASE to serve upstream ZIP from bastion (LAN speed)
- Fix ZIP32 vs ZIP64: serve original upstream ZIP unmodified
  (our repackaged ZIP used ZIP64 which breaks Asahi urlcache)
- Add /data/asahi-repo fallback path for k3s container PVC mount
- Deploy script syncs asahi-repo to bastion pod after deployment

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 23:32:38 +01:00
Michal
aea28b5a0f fix: Cilium multi-node support — auto-detect NIC, k3s agent API port, worker label
Some checks failed
CI/CD / typecheck (pull_request) Failing after 10s
CI/CD / lint (pull_request) Failing after 22s
CI/CD / test (pull_request) Failing after 7m8s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
- Remove hardcoded devices/directRoutingDevice from Cilium install (let
  Cilium auto-detect per node — needed for heterogeneous NICs like eno1 vs enP7s7)
- Set k8sServiceHost=127.0.0.1 k8sServicePort=6444 so Cilium init
  containers can reach the API via k3s agent's local LB proxy
- Add node-role.kubernetes.io/worker label to agent config

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 01:35:51 +01:00
Michal
46b017d77e feat: install logging, error trapping, PXE/ISO integration tests
Some checks failed
CI/CD / lint (pull_request) Failing after 13s
CI/CD / test (pull_request) Failing after 10s
CI/CD / typecheck (pull_request) Failing after 36s
CI/CD / build (pull_request) Has been skipped
CI/CD / publish-rpm (pull_request) Has been skipped
CI/CD / publish-deb (pull_request) Has been skipped
Kickstart installs on real hardware failed silently — no error reporting,
only 3 progress callbacks, zero log streaming. This overhaul makes every
install fully observable.

Kickstart improvements:
- Error trapping in %pre and %post (trap ERR sends failure details to bastion)
- 12+ granular progress stages (was 3): SSH, hostname, k3s prep, EFI boot, metadata
- Background log streamer: tails %post output and batch-sends to /api/log
- bastion_log() function for explicit log lines from kickstart scripts

Bastion API:
- POST /api/log — receives raw log lines from kickstart (single or batch)
- InstallLogBuffer — per-MAC ring buffer (2000 lines) + file persistence
- GET /api/logs/:mac — now returns log_lines + log_total alongside stages
- SSE /api/logs/:mac/follow — uses named events (event: stage vs event: log)
- Progress events forwarded to labd via bastion-progress WebSocket message
- Post-provision k3s logs routed through progressBus (was console-only)

dnsmasq fixes found during VM testing:
- HTTP Boot filename: ipxe-real.efi → ipxe.efi (leftover from old 2-stage approach)
- pxe-service directives: only in proxy mode (breaks OVMF PXE in full mode)
- PXEClient vendor class echo for UEFI firmware compatibility

Integration tests:
- PXE boot test: blank UEFI VM → dnsmasq → HTTP Boot → iPXE → bastion → install
- ISO boot test: blank VM boots from bastion-generated ISO → same flow
- Shared helpers: pxe-network (no DHCP, nftables fix), pxe-vm (UEFI + ISO boot)
- test-provision.sh: runs both PXE + ISO tests with prerequisite checks
- 250GB sparse QCOW2 disk (LVM layout needs ~204GB)

201 unit tests passing (11 new).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 22:26:33 +00:00