michal/lab - lab - Gitea: Git with a cup of tea

michal/lab

Author	SHA1	Message	Date
Michal	dd92147341	fix(k3s): route audit logs through journald, codify etcd member recovery Some checks failed CI/CD / typecheck (pull_request) Failing after 13s Details CI/CD / lint (pull_request) Failing after 23s Details CI/CD / test (pull_request) Failing after 10s Details CI/CD / build (pull_request) Has been skipped Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details Two changes prompted by today's etcd raft panic on worker1-k8s0 (tocommit out of range, lost-write on follower) and the cascading disk pressure that surfaced underneath it. Audit logs to journald - kube-apiserver now uses audit-log-path=- so audit events flow to k3s.service stdout and into journald instead of growing files in /var/log/kubernetes. The previous setup combined apiserver's internal rotation with a logrotate *.log glob that double-rotated the rotated files into permanent orphans (observed: 7+ GB). - New journald-limits operation writes a SystemMaxUse=2G drop-in so audit volume cannot fill /var/log even under bursty load. - log-rotation operation repurposed to decommission the obsolete logrotate rule and reap leftover audit files. Idempotent: no-op on fresh installs. Etcd member recovery - New recoverEtcdMember(broken, peer, hostname) codifies the documented k3s recovery: stop k3s, etcdctl member remove, wipe /var/lib/rancher/k3s/server/{db,tls,cred}, restart, poll for rejoin. Refuses to operate when cluster size < 3 to preserve quorum. Tests - 7 new unit tests covering both decommission paths and the recovery procedure (54 total, all green). - install.test.ts asserts the file-based audit args are gone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 21:29:16 +01:00
Michal	9ddab24931	feat: provision recheck, hardware info preservation, ISO boot fixes Some checks failed CI/CD / lint (pull_request) Failing after 1m26s Details CI/CD / typecheck (pull_request) Failing after 11s Details CI/CD / test (pull_request) Failing after 11s Details CI/CD / build (pull_request) Has been skipped Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details - Add `labctl provision recheck` to refresh hardware info via SSH - Preserve hardware info in InstalledInfo when install completes - Fix /ks-auto: run nested %pre scripts from included kickstarts - Add command-discover WebSocket routing for hw info updates - Fix k3s join: clean stale TLS/cred when joining existing cluster - Add --tls-verify=false for internal HTTP registry pushes - Add fix-ssh-root.sh script for root SSH access on all nodes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 17:59:39 +01:00
Michal	06fc40a857	fix: k3s install automation — skip Cilium on join, Longhorn via server, default root user Some checks failed CI/CD / typecheck (push) Failing after 10s Details CI/CD / test (push) Failing after 9s Details CI/CD / lint (push) Failing after 22s Details CI/CD / build (push) Has been skipped Details CI/CD / publish-rpm (push) Has been skipped Details CI/CD / publish-deb (push) Has been skipped Details - Skip Cilium install for joining servers (already in cluster via daemonset) - Longhorn annotation for workers: SSH to server node from CLI to apply kubectl annotation (workers don't have kubectl access) - Default SSH user for k3s/app commands changed to 'root' (operations need root privileges, using 'lab' user broke installs) - k3s server config: cluster-init for initial server, server+token for joins Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 16:02:19 +01:00
Michal	a68d6d617e	feat: k3s cluster-init for etcd HA, fix Cilium duplicate install Some checks failed CI/CD / lint (push) Failing after 11s Details CI/CD / test (push) Failing after 10s Details CI/CD / typecheck (push) Failing after 22s Details CI/CD / build (push) Has been skipped Details CI/CD / publish-rpm (push) Has been skipped Details CI/CD / publish-deb (push) Has been skipped Details - Server config now uses cluster-init: true for initial server (enables embedded etcd). Joining servers get server: + token: in config. - Cilium install already checks for existing installation, so joining servers skip it gracefully (the "release name in use" error is non-fatal) Cluster rebuilt as etcd HA: worker0-k8s0 control-plane,etcd (initial server, cluster-init) worker1-k8s0 control-plane,etcd (joined server, Mac Studio aarch64) spark-2935 worker (DGX Spark, aarch64) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 15:53:18 +01:00
Michal	c49a650888	fix: firstboot fstab handling — no duplicates, compatible with Asahi sed Some checks failed CI/CD / typecheck (push) Failing after 10s Details CI/CD / test (push) Failing after 11s Details CI/CD / lint (push) Failing after 23s Details CI/CD / build (push) Has been skipped Details CI/CD / publish-rpm (push) Has been skipped Details CI/CD / publish-deb (push) Has been skipped Details - Replace sed with grep -v / awk for fstab manipulation (Asahi Fedora's sed doesn't support \\| delimiter or \? quantifier) - Use idempotent write_lab_fstab function: removes all old entries first, comments out conflicting btrfs subvol entries, adds fresh LVM entries - Fix sed for SSH hardening: use #* instead of \? (POSIX compatible) - Tested on Mac Studio: no duplicate fstab entries after multiple runs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 15:40:29 +01:00
Michal	87e09af941	fix: default admin user to 'lab', case-insensitive OS detection for iSCSI Some checks failed CI/CD / typecheck (push) Failing after 10s Details CI/CD / test (push) Failing after 10s Details CI/CD / lint (push) Failing after 22s Details CI/CD / build (push) Has been skipped Details CI/CD / publish-rpm (push) Has been skipped Details CI/CD / publish-deb (push) Has been skipped Details - Firstboot script defaults admin user to 'lab' instead of bastion's config.adminUser (which was 'michal' from host system) - iSCSI OS detection uses case-insensitive match for 'fedora' Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 15:13:53 +01:00
Michal	bb8f37ef7d	feat: iSCSI, Longhorn disk labels, labctl asahi command, ZIP32 fix Some checks failed CI/CD / typecheck (pull_request) Failing after 12s Details CI/CD / lint (pull_request) Failing after 22s Details CI/CD / test (pull_request) Failing after 10s Details CI/CD / build (pull_request) Has been skipped Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details k3s host prep: - Add iSCSI initiator install+enable (Fedora: iscsi-initiator-utils, Ubuntu: open-iscsi) — required by Longhorn - Add Longhorn disk label to k3s server+agent configs - Add Longhorn disk annotation operation in post-install hardening CLI: - Add `labctl provision asahi` command with interactive install guide - Change default SSH user from "michal" to "lab" in all commands - Change admin user in bastion progress callback to "lab" Asahi provisioning fixes: - Download installer_data.json locally (installer reads it as file) - Use REPO_BASE to serve upstream ZIP from bastion (LAN speed) - Fix ZIP32 vs ZIP64: serve original upstream ZIP unmodified (our repackaged ZIP used ZIP64 which breaks Asahi urlcache) - Add /data/asahi-repo fallback path for k3s container PVC mount - Deploy script syncs asahi-repo to bastion pod after deployment Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 23:32:38 +01:00
Michal	aea28b5a0f	fix: Cilium multi-node support — auto-detect NIC, k3s agent API port, worker label Some checks failed CI/CD / typecheck (pull_request) Failing after 10s Details CI/CD / lint (pull_request) Failing after 22s Details CI/CD / test (pull_request) Failing after 7m8s Details CI/CD / build (pull_request) Has been skipped Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details - Remove hardcoded devices/directRoutingDevice from Cilium install (let Cilium auto-detect per node — needed for heterogeneous NICs like eno1 vs enP7s7) - Set k8sServiceHost=127.0.0.1 k8sServicePort=6444 so Cilium init containers can reach the API via k3s agent's local LB proxy - Add node-role.kubernetes.io/worker label to agent config Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-31 01:35:51 +01:00
Michal	46b017d77e	feat: install logging, error trapping, PXE/ISO integration tests Some checks failed CI/CD / lint (pull_request) Failing after 13s Details CI/CD / test (pull_request) Failing after 10s Details CI/CD / typecheck (pull_request) Failing after 36s Details CI/CD / build (pull_request) Has been skipped Details CI/CD / publish-rpm (pull_request) Has been skipped Details CI/CD / publish-deb (pull_request) Has been skipped Details Kickstart installs on real hardware failed silently — no error reporting, only 3 progress callbacks, zero log streaming. This overhaul makes every install fully observable. Kickstart improvements: - Error trapping in %pre and %post (trap ERR sends failure details to bastion) - 12+ granular progress stages (was 3): SSH, hostname, k3s prep, EFI boot, metadata - Background log streamer: tails %post output and batch-sends to /api/log - bastion_log() function for explicit log lines from kickstart scripts Bastion API: - POST /api/log — receives raw log lines from kickstart (single or batch) - InstallLogBuffer — per-MAC ring buffer (2000 lines) + file persistence - GET /api/logs/:mac — now returns log_lines + log_total alongside stages - SSE /api/logs/:mac/follow — uses named events (event: stage vs event: log) - Progress events forwarded to labd via bastion-progress WebSocket message - Post-provision k3s logs routed through progressBus (was console-only) dnsmasq fixes found during VM testing: - HTTP Boot filename: ipxe-real.efi → ipxe.efi (leftover from old 2-stage approach) - pxe-service directives: only in proxy mode (breaks OVMF PXE in full mode) - PXEClient vendor class echo for UEFI firmware compatibility Integration tests: - PXE boot test: blank UEFI VM → dnsmasq → HTTP Boot → iPXE → bastion → install - ISO boot test: blank VM boots from bastion-generated ISO → same flow - Shared helpers: pxe-network (no DHCP, nftables fix), pxe-vm (UEFI + ISO boot) - test-provision.sh: runs both PXE + ISO tests with prerequisite checks - 250GB sparse QCOW2 disk (LVM layout needs ~204GB) 201 unit tests passing (11 new). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 22:26:33 +00:00

9 Commits